Containers Trend Report. Explore the current state of containers, containerization strategies, and modernizing architecture.
Securing Your Software Supply Chain with JFrog and Azure. Leave with a roadmap for keeping your company and customers safe.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
A headless cluster in Amazon Aurora global database refers to a cluster that has no database instance attached to it. In other words, the cluster only includes the database storage and is used as a backup in case of a failure of the primary cluster. The primary cluster is the one that has a database instance attached to it and is serving the application's read and write requests. The headless secondary cluster is used for disaster recovery and to provide high availability for the database. The data in the primary cluster is automatically replicated to the headless secondary cluster, so in case of a failure, the secondary cluster can take over as the primary cluster and provide a seamless transition with minimal disruption. In this article, we will see how to perform a failover with a headless Aurora Global Database and its outcomes. Steps To Create Aurora Global Database I’ve created an RDS database (named "testaurocluster") and created a global cluster using the AWS CLI using the following commands: Create a Global Cluster aws rds create-global-cluster --region us-east-1 \ --global-cluster-identifier global-cluster \ --source-db-cluster-identifier arn:aws:rds:us-east-1:<AWSAccountId>:cluster:testaurocluster Create a Headless Secondary Cluster Textile aws rds create-db-cluster \ --region us-west-1 \ --source-region us-east-1 \ --db-cluster-identifier testaurocluster\ --global-cluster-identifier global-cluster \ --engine aurora-mysql \ --engine-version 5.7.mysql_aurora.2.10.2 \ --kms-key-id <key value> KMS Key Make sure you used the correct KMS key before running the above command. (The KMS key should belong to the region where you are going to create the secondary cluster, i.e., us-west-1). Cloud Watch Metric Health Check Create a CloudWatch metric for health checks: Log in to the Amazon Web Services (AWS) Management Console. Navigate to the RDS service. Select the RDS cluster you want to monitor. Click on the "Metrics" tab. Click on the "Create Alarm" button. Select the "RDS" namespace and the "DBInstanceIdentifier" dimension. Choose the metric (e.g., CPUUtilization). Set the alarm threshold (e.g., greater than or equal to 75%). Choose the desired alarm actions (e.g., sending an SNS notification). Give the alarm a name and description (optional). Click the "Create Alarm" button. Your CloudWatch metric for CPU utilization of the RDS cluster is now set up. The alarm will trigger the specified actions if the CPU utilization exceeds the threshold. Health Check Using Route 53 Log in to the Amazon Route 53 console. Choose "Health checks" from the navigation panel. Choose "Create health check." Give your health check a name and choose "CloudWatch Alarm" as the type. Choose the appropriate alarm from the list. The alarm should reflect the health of your RDS cluster. Choose "Create." Once the health check has been created, you can use it in a Route 53 record set to route traffic to your RDS cluster. This will allow Route 53 to monitor the state of the CloudWatch alarm and redirect traffic to a healthy endpoint if one of your instances becomes unavailable. The state of the CloudWatch alarm should reflect the health of your RDS cluster, so you will need to set up the alarm appropriately in CloudWatch. Route53 Hosted Zone Create a Route53 Hosted Zone: Open the Route 53 management console in the AWS Management Console. In the navigation pane, choose "Hosted zones." Choose "Create Hosted Zone." In the "Create Hosted Zone" dialog box, enter a domain name for the hosted zone. Choose "Create" Create a DNS Record for the DB Cluster: In the Route 53 management console, select the hosted zone that you created in Step 2. Choose "Create Record. " Add the record’s name. In the "Create Record " dialog box, set the type of record to "A" or "CNAME," depending on your use case. Check the alias button and fill in the option as required. Choose the endpoint and enter the endpoint of the cluster in the field. Select failover as a route policy and choose failover record type as primary. Add the health check ID that was created in the previous step. Give a name to record the ID. Click on add record button and add one more record for a secondary cluster. Choose "Create" Pre-Failover Step Before proceeding with failover, you must add a DB instance to an existing headless cluster in the secondary region. After a failover, the headless secondary cluster becomes the primary cluster and takes over the role of serving the read and write requests from the application. It needs to configure the new database instance with the necessary settings, such as the database engine type, storage type, and instance size, and then connect it to your application. It is important to have a well-defined disaster recovery plan in place to ensure that you can quickly and effectively respond to a failover event. Using the AWS CLI command to add a DB instance to an existing headless cluster in the secondary region: Textile aws rds create-db-instance \ --db-instance-identifier testaurocluster-instance-1 \ --db-cluster-identifier testaurocluster\ --db-instance-class db.r5.large \ --engine aurora-mysql \ --region us-west-1 CloudWatch Dashboard Create a CloudWatch dashboard to check the metrics over time and quickly see any changes or trends in performance after deleting the read instance. Log in to the AWS Management Console. Navigate to the CloudWatch service. Click on "Dashboards" in the left navigation panel. Click on the "Create dashboard" button. Give the dashboard a name and choose a preferred layout. Click on the "Add widget" button to add the relevant metrics to the dashboard. Select the metrics you want to add and set the desired time range for them. Customize the appearance of the widget if desired, such as adding a title, axis labels, and legend. Click on the "Save dashboard" button to save the dashboard. Failover Process Steps Navigate to the RDS Console and select the target global database for which you want to perform a failover. Initiate a failover. You will be prompted to select a target instance to which to perform the failover. After the failover is initiated, you can verify that it has been completed by checking the status of the target instance in the RDS Console. The target instance should now be in the "available" state, and the original primary instance should now be in the "failover" state. Monitor the failover: After the failover is complete, it is important to monitor the performance of the new primary instance to ensure that it is serving traffic as expected. This can be done through Amazon CloudWatch, or by monitoring the performance of your application. Failover Time It took a few minutes to complete the failover process (time can be tracked using RDS event logs). However, the failover of a global Amazon Aurora cluster can take anywhere from a few minutes to several hours, depending on the following factors. Network latency (the time it takes to redirect traffic from the primary cluster to the secondary cluster will depend on the network latency between the two clusters). Size of the database (larger databases will take longer to failover than smaller databases). Performance of the secondary cluster (performance of the secondary cluster, as a slower secondary cluster will take longer to start serving traffic than a faster secondary cluster). Application requirements (specific requirements of your application, such as the amount of data that needs to be recovered or the number of updates that need to be made to your DNS records). In the secondary (previously headless) Amazon Aurora database cluster, there is only one instance. When the secondary cluster becomes the primary cluster after a failover, this single instance acts as both the reader and writer instances. It serves to read requests from the application and updates the database, making it the new primary writer instance. It's important to note that in Amazon Aurora, the reader and writer instances are separate and can be managed independently, even in the same cluster. However, in the case of a headless secondary cluster, there is only one instance that acts as both the reader and writer. This single instance is designed to provide a cost-effective solution for disaster recovery, as it eliminates the need for a separate reader instance in the secondary cluster. Conclusion In conclusion, performing a failover with a headless Aurora Global Database can provide significant benefits for achieving cost-effective multi-region resiliency.
In this article, we'll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset. We'll see that SingleStoreDB can manage this dataset with ease. SingleStoreDB has supported a range of vector functions for some time, and these functions are ideally suited for modern applications using GPT technology. The notebook file used in this article is available on GitHub. Introduction In several previous articles, we have used some of the vector capabilities built into SingleStoreDB: Quick tip: SingleStoreDB's EUCLIDEAN_DISTANCE and JSON_ARRAY_PACK functions Using SingleStoreDB, Spark, and Alternating Least Squares (ALS) to build a Movie Recommender System In this article, we'll test the JSON_ARRAY_PACK and DOT_PRODUCT vector functions with the OpenAI Wikipedia Vector Database dataset. There is an OpenAI notebook available on GitHub under an MIT License that tests several Vector Database systems. The tests can be run using local clients or in the cloud. In this article, we'll use a local installation of SingleStoreDB. Install SingleStoreDB In this article, we'll install SingleStoreDB in a Virtual Machine (VM) environment. It takes just a few minutes. A previous article described the steps. Alternatively, we could use a Docker image. For this article, we'll only need two tarball files for the VM installation: singlestoredb-toolbox singlestoredb-server Assuming a two-node cluster was correctly deployed and using the same variable names from the previous article, we can connect to our cluster from a MySQL CLI Client as follows: Shell mysql -u root -h ubuntu -P 3306 --default-auth=mysql_native_password -p Once connected to our cluster, we'll create a new database as follows: SQL CREATE DATABASE IF NOT EXISTS openai_demo; Install Jupyter From the command line, we'll install the classic Jupyter Notebook as follows: Shell pip install notebook OpenAI API Key Before launching Jupyter, we must create an account on the OpenAI website. This provides some free credits. Since we will use embeddings, the cost will be minimal. We'll also need to create an OpenAI API Key. This can be created from USER > API keys in our OpenAI account. From the command line, we'll export the OPENAI_API_KEY variable in our environment, as follows: Shell export OPENAI_API_KEY=<OpenAI API Key> Replace <OpenAI API Key> with your key. Next, we'll launch Jupyter as follows: Shell jupyter notebook Fill Out the Notebook Let's now create a new notebook. We'll adhere to the flow and structure of the OpenAI notebook and use some small code sections directly from the notebook, where required. Setup First, some libraries: Python !pip install matplotlib !pip install openai !pip install plotly.express !pip install scikit-learn !pip install singlestoredb !pip install tabulate !pip install wget Next, some imports: Python import openai import pandas as pd import os import wget from ast import literal_eval Then, the embedding model: Python EMBEDDING_MODEL = "text-embedding-ada-002" Load Data We'll now obtain the Wikipedia dataset: Python embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip' # The file is ~700 MB so this will take some time wget.download(embeddings_url) And unpack it: Python import zipfile with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref: zip_ref.extractall("data") Next, we'll load the file into a Pandas Dataframe: Python article_df = pd.read_csv( "data/vector_database_wikipedia_articles_embedded.csv" ) And we'll take a look at the first few lines as follows: Python article_df.head() The next operation from the OpenAI notebook can take a while: Python # Read vectors from strings back into a list article_df['title_vector'] = article_df.title_vector.apply(literal_eval) article_df['content_vector'] = article_df.content_vector.apply(literal_eval) # Set vector_id to be a string article_df['vector_id'] = article_df['vector_id'].apply(str) Next, we'll look at the dataframe info: Python article_df.info(show_counts=True) The result should be as follows: Plain Text <class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 25000 non-null int64 1 url 25000 non-null object 2 title 25000 non-null object 3 text 25000 non-null object 4 title_vector 25000 non-null object 5 content_vector 25000 non-null object 6 vector_id 25000 non-null object dtypes: int64(1), object(6) Create Table Let's now create a connection to our local installation of SingleStoreDB: Python import singlestoredb as s2 conn = s2.connect("root:<password>@<host>:3306/openai_demo") cur = conn.cursor() We'll replace the values for <password> and <host> with the values that we used earlier at installation time. We'll now create a table as follows: Python stmt = """ CREATE TABLE IF NOT EXISTS wikipedia ( id INT PRIMARY KEY, url VARCHAR(255), title VARCHAR(100), text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, title_vector BLOB, content_vector BLOB, vector_id INT ) """ cur.execute(stmt) Populate Table We can populate our database table as follows: Python # Prepare the statement stmt = """ INSERT INTO wikipedia ( id, url, title, text, title_vector, content_vector, vector_id ) VALUES ( %s, %s, %s, %s, JSON_ARRAY_PACK_F64(%s), JSON_ARRAY_PACK_F64(%s), %s ) """ # Convert the DataFrame to a NumPy record array record_arr = article_df.to_records(index=False) # Set the batch size batch_size = 1000 # Iterate over the rows of the record array in batches for i in range(0, len(record_arr), batch_size): batch = record_arr[i:i+batch_size] values = [( row[0], row[1], row[2], row[3], str(row[4]), str(row[5]), int(row[6]) ) for row in batch] cur.executemany(stmt, values) We can also use JSON_ARRAY_PACK_F32 (32-bit, IEEE standard format), instead of JSON_ARRAY_PACK_F64 (64-bit, IEEE standard format). Loading the data should take a short time. We can use other data loading methods, such as pipelines, for larger datasets. Search Data First, we'll import the following: Python from openai.embeddings_utils import get_embedding Next, we'll check that the OPENAI_API_KEY can be read, as follows: Python if os.getenv("OPENAI_API_KEY") is not None: openai.api_key = os.getenv("OPENAI_API_KEY") print ("OPENAI_API_KEY is ready") else: print ("OPENAI_API_KEY environment variable not found") We'll now define a Python function that will allow us to use either of the two vector columns in the database: Python from typing import Tuple, List def search_wikipedia( query: str, column1: str, column2: str, num_rows: int = 10 ) -> Tuple[List[str], List[float]]: """Searches Wikipedia for the given query and returns the top `num_rows` results. Args: query: The query to search for. column1: The name of the column in the Wikipedia database to return for each result. column2: The name of the column in the Wikipedia database to use as the score for each result. num_rows: The number of results to return. Returns: A list of the top `num_rows` results. """ # Get the embedding of the query embedding = get_embedding(query, EMBEDDING_MODEL) # Create the SQL statement stmt = f""" SELECT {column1}, DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(%s), {column2}) AS score FROM wikipedia ORDER BY score DESC LIMIT %s """.format(column1=column1, column2=column2) # Execute the SQL statement cur.execute(stmt, [str(embedding), num_rows]) # Get the results results = cur.fetchall() # Separate the results into two lists values = [row[0] for row in results] scores = [row[1] for row in results] # Return the results return values, scores We can test SingleStoreDB using the two examples in the OpenAI notebook. First, we'll use title and title_vector: Python values1, scores1 = search_wikipedia( query = "modern art in Europe", column1 = "title", column2 = "title_vector", num_rows = 5 ) We'll format the results using the following: Python from tabulate import tabulate # Combine the values and scores lists into a list of tuples # Each tuple contains a value and its corresponding score table_data1 = list(zip(values1, scores1)) # Add a rank column to the table data table_data1 = [(i + 1,) + data for i, data in enumerate(table_data1)] # Create the table table1 = tabulate(table_data1, headers=["Rank", "Title", "Score"]) # Print the table print(table1) The output should be similar to the following: Plain Text Rank Title Score ------ -------------------- -------- 1 Museum of Modern Art 0.875081 2 Western Europe 0.867523 3 Renaissance art 0.864172 4 Pop art 0.860346 5 Northern Europe 0.854755 Next, we'll use text and content_vector: Python values2, scores2 = search_wikipedia( query = "Famous battles in Scottish history", column1 = "text", column2 = "content_vector", num_rows = 5 ) We'll format the results using the following: Python # Combine the values and scores lists into a list of tuples # Each tuple contains a value and its corresponding score table_data2 = list(zip([value[:50] for value in values2], scores2)) # Add a rank column to the table data table_data2 = [(i + 1,) + data for i, data in enumerate(table_data2)] # Create the table table2 = tabulate(table_data2, headers=["Rank", "Text", "Score"]) # Print the table print(table2) The output should be similar to the following: Plain Text Rank Text Score ------ -------------------------------------------------- -------- 1 The Battle of Bannockburn, fought on 23 and 24 Jun 0.869338 2 The Wars of Scottish Independence were a series of 0.86148 3 Events 0.852533 January 1 – Charles II crowned King of 4 The First War of Scottish Independence lasted from 0.849642 5 Robert I of Scotland (11 July 1274 – 7 June 1329) 0.846184 Summary In this article, we've seen that SingleStoreDB can store vectors with ease and that we can also store other data types in the same table, such as numeric and text. With its powerful SQL and multi-model support, SingleStoreDB provides a one-stop solution for modern applications bringing both technical and business benefits through a single product.
Scalability and low latency are crucial for any application that relies on real-time data. One way to achieve this is by storing data closer to the users. In this post, we'll discuss how you can use YugabyteDB and its read replica nodes to improve the read latency for users across the globe. Whether you prefer reading or watching, let’s walk through a practical example using an application that streams market orders to see how you can achieve low latency for reads regardless of a user's location. YugabyteDB Managed Cluster Let's say you have a YugabyteDB Managed cluster consisting of three nodes running in the United States. Next, assume the following: You start the application that connects to this cluster and streams market orders into the database. A user from the United States opens a database connection and queries for the most popular stocks. In this case, the latency will be around 20 milliseconds. However, if a different user from Southeast Asia sends the same query, the latency will be much higher—around 190 milliseconds. The increased latency for users from distant locations, such as Southeast Asia, can negatively impact the experience. One of the options to tackle this issue is by storing data closer to the users with read replicas. Deploying Read Replica Read replicas store a full or partial copy of the primary database. Directing read queries from the users to the nearest read replica makes it possible to improve the read latency significantly. Let’s add a replica node to the USA-based YugabyteDB Managed cluster to demonstrate how replicas work in practice. The replica will be added to the asia-east1 region. Follow these steps to set up your replica node: Navigate to your YugabyteDB Managed cluster dashboard. Click on "Configure Read Replicas." Select the "asia-east1" region. Confirm the configuration. The replica node in Southeast Asia will be ready within a few minutes. Once the node becomes active in the specified region, it synchronizes with the main database to keep the data consistent. Connecting Users to Nearby Replica Nodes The next step is to connect the users from Southeast Asia to the newly deployed replica node in the asia-east1 region. Once that’s done, it's time to compare the latencies. Now, when users from Southeast Asia query for the most popular stocks again, the latency drops dramatically to 20 milliseconds – similar to the latency for the user from the United States. As a result, users from both locations will have a comparably better user experience thanks to the improved latency. Summary Read replicas can significantly improve the read latency for users in distant locations. By deploying replica nodes in different regions and connecting users to their nearest replica, you can match the experience for all of your users, regardless of where they are located.
In this blog, you will learn how to backup and restore a PostgreSQL database. Enjoy! 1. Introduction Some time ago, I needed to back up a PostgreSQL database from a production server in order to be able to fix a problem that was difficult to reproduce in the test environment. It appeared that I could not find the answer very quickly by means of a Google search. After a while, I managed to find out the commands I needed to use, and it seemed to me a good idea to share this knowledge. In order to test the commands, I will be using a database I created for a previous blog. The schema is quite basic and is shown here for information purposes only. It is not relevant for the remainder of the post. The database is called myjhipsterplanet and that is the one you will back up and restore in this post. 2. Prerequisites Prerequisites needed for this blog: Ubuntu 22.04 is used; Basic knowledge of Docker and Docker Compose is needed; Basic PostgreSQL knowledge is needed. PostgreSQL 14.5 is used. 3. Preparation In this section, a database is prepared, which you will use to backup and restore. Skip this section if you just want to know how to use the backup and restore commands. The database backup and Docker Compose file used in this section are available at GitHub. The Docker Compose file will start a PostgreSQL database containing a basic setup where the default admin user postgres is replaced with a user mypostgresqldumpplanet. The PostgreSQL data is mounted to the directory ~/docker/volumes/postgresqldata. YAML version: '3.8' services: mypostgresqldumpplanet-postgresql: image: postgres:14.5 volumes: - ~/docker/volumes/postgresqldata/:/var/lib/postgresql/data/ environment: - POSTGRES_USER=mypostgresqldumpplanet - POSTGRES_PASSWORD= - POSTGRES_HOST_AUTH_METHOD=trust ports: - 127.0.0.1:5432:5432 Execute the following command from the root of the repository in order to start a PostgreSQL Docker container. Shell $ docker compose -f postgresql.yml up -d [+] Running 2/2 Network mypostgresqldumpplanet_default Created 0.1s Container mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 Started 0.4s Copy the database dump file from the root of the repository to the mounted PostgreSQL directory. Shell $ sudo cp 2023-04-01-original.dump ~/docker/volumes/postgresqldata/ Now you need to open a bash shell inside the container. This can be done with the docker exec command. Inside the container, the Linux user postgres has to be used; therefore, the -u argument needs to be passed. Shell $ docker exec -it -u postgres mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 bash postgres@4934dd659171:/$ Now you have access to bash inside the container. You need to restore the database from the dump by means of pg_restore. Shell $ pg_restore -U mypostgresqldumpplanet -d postgres -C /var/lib/postgresql/data/2023-04-01-original.dump pg_restore: while PROCESSING TOC: pg_restore: from TOC entry 3382; 1262 16384 DATABASE myjhipsterplanet myjhipsterplanet pg_restore: error: could not execute query: ERROR: role "myjhipsterplanet" does not exist Command was: ALTER DATABASE myjhipsterplanet OWNER TO myjhipsterplanet; ... The parameters used are explained: -U mypostgresqldumpplanet: the admin database user. -d postgres: you need to provide a database that pg_restore can connect to. Because this is an empty instance of PostgreSQL, you use the postgres database, which is always available with the -d argument. -C: Because you have an empty PostgreSQL database, you need to provide the argument -C in order that pg_restore (the backup restore tool) will create the database for you. /var/lib/postgresql/data/2023-04-01-original.dmp: The file location of the database dump. The database dump is accessible via the mount in the directory /var/lib/postgresql/data/. As you can see in the output, the restore was not completely successful because of a missing role. Enter the database, connect to the database postgres with the user mypostgresqldumpplanet. Shell $ psql -d postgres -U mypostgresqldumpplanet Create the missing role. Shell postgres=# CREATE ROLE myjhipsterplanet; CREATE ROLE Check which databases exist with the \l command. Shell postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges ------------------------+------------------------+----------+------------+------------+--------------------------------------------------- myjhipsterplanet | myjhipsterplanet | UTF8 | en_US.utf8 | en_US.utf8 | mypostgresqldumpplanet | mypostgresqldumpplanet | UTF8 | en_US.utf8 | en_US.utf8 | postgres | mypostgresqldumpplanet | UTF8 | en_US.utf8 | en_US.utf8 | template0 | mypostgresqldumpplanet | UTF8 | en_US.utf8 | en_US.utf8 | =c/mypostgresqldumpplanet + | | | | | mypostgresqldumpplanet=CTc/mypostgresqldumpplanet template1 | mypostgresqldumpplanet | UTF8 | en_US.utf8 | en_US.utf8 | =c/mypostgresqldumpplanet + | | | | | mypostgresqldumpplanet=CTc/mypostgresqldumpplanet (5 rows) Note that the database myjhipsterplanet is created. However, not everything was created correctly, so drop the database. Shell postgres=# DROP DATABASE myjhipsterplanet; DROP DATABASE Use the quit command to exit the postgres prompt. Try the restore command again, and this time the restore is successful. Connect via psql to the database. Use the \c command to use the myjhipsterplanet database. Shell postgres=# \c myjhipsterplanet You are now connected to database "myjhipsterplanet" as user "mypostgresqldumpplanet". Verify that the customer table contains data. Shell myjhipsterplanet=# select * from customer; id | customer_name | company_id ----+-------------------------------+------------ 1 | Assistant | 2 | Market | 3 | program payment User-friendly | 4 | Team-oriented | 5 | index | 6 | auxiliary experiences | 7 | Sahara | 8 | Kong deposit | 9 | indexing Ball | 10 | users | (10 rows) In order to be able to verify the database will be restored correctly, later on, you change one of the customer records. Shell myjhipsterplanet=# UPDATE customer SET customer_name = 'Assistent1' WHERE id = 1; 4. Backup Database In order to backup the database, pg_dump is used. The official documentation of pg_dump can be found at the PostgreSQL website. Also, help information can be retrieved with the following command: Shell postgres@4934dd659171:/$ pg_dump --help If you are running PostgreSQL inside a Docker container, you need to have access to a bash shell inside the container. First, you need to retrieve the name of the Docker container. Shell $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 9dea76770a2e postgres:14.5 "docker-entrypoint.s…" 5 seconds ago Up 4 seconds 127.0.0.1:5432->5432/tcp mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 Open a bash shell inside the container, as described above: Shell $ docker exec -it -u postgres mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 bash postgres@4934dd659171:/$ There are four formats for creating a backup, and each of them is described in the next sections. 4.1 Plain Text SQL Backup By default, pg_dump will create a plain text SQL backup. Do not use this default if you want to use the backup to restore with pg_restore. Shell postgres@4934dd659171:/$ pg_dump -f /var/lib/postgresql/data/2023-04-01-plaintext.sql -U mypostgresqldumpplanet myjhipsterplanet The parameters explained: -f /var/lib/postgresql/data/2023-04-01-plaintext.sql: The name of the backup file; -U mypostgresqldumpplanet: The name of the PostgreSQL admin user, often this will be postgres; myjhipsterplanet: The name of the database you want to backup. 4.2 Custom Backup The PostgreSQL custom backup format. Compressed by default and most likely the best option to use for creating the backup. Shell postgres@4934dd659171:/$ pg_dump -F c -f /var/lib/postgresql/data/2023-04-01-custom.dump -U mypostgresqldumpplanet myjhipsterplanet The arguments are similar for the plain text SQL backup, only here, you need to add the parameter -F c in order to choose the custom backup format. 4.3 Directory Backup The directory backup format. Compressed by default and creates a directory with one file for each table and blob being dumped. This format also supports parallel dumps. Shell postgres@4934dd659171:/$ pg_dump -F d -f /var/lib/postgresql/data/2023-04-01-directory -U mypostgresqldumpplanet myjhipsterplanet The arguments are similar for the plain text SQL backup, only here, you need to add the parameter -F d in order to choose the directory backup format. Some websites explaining how to create PostgreSQL backups do not use the -f parameter for specifying the output file. Instead, they redirect the output of the pg_dump command to a file with the greater-than-sign (>). However, this will not work with the directory backup format as you will encounter the following error: Shell postgres@4934dd659171:/$ pg_dump -F d -U mypostgresqldumpplanet myjhipsterplanet > /var/lib/postgresql/data/2023-04-01-directory/ bash: /var/lib/postgresql/data/2023-04-01-directory/: Is a directory 4.4 Tar Backup The tar backup format. Does not support compression but extracting the tar leads to a valid directory format. Shell postgres@4934dd659171:/$ pg_dump -F t -f /var/lib/postgresql/data/2023-04-01-tar.tar -U mypostgresqldumpplanet myjhipsterplanet The arguments are similar as for the plain text SQL backup, only here you need to add the parameter -F t in order to choose for the tar backup format. 5. Restore Database If you follow the steps in this blog, you first need to restore the changed customer record to its original value. Connect to the database and change the record again. Shell myjhipsterplanet=# UPDATE customer SET customer_name = 'Assistent' WHERE id = 1; In order to backup the database, pg_restore is used. The official documentation of pg_restore can be found at the PostgreSQL website. Also, help information can be retrieved with the following command: Shell postgres@4934dd659171:/$ pg_restore --help 5.1 Plain Text SQL Restore As mentioned before, this will not work with pg_restore. The following error will be shown: Shell postgres@4934dd659171:/$ pg_restore -U mypostgresqldumpplanet -d myjhipsterplanet -c /var/lib/postgresql/data/2023-04-01-plaintext.sql pg_restore: error: input file appears to be a text format dump. Please use psql. 5.2 Custom Restore The custom restore works as a charm: Shell postgres@4934dd659171:/$ pg_restore -U mypostgresqldumpplanet -d myjhipsterplanet -c /var/lib/postgresql/data/2023-04-01-custom.dump The parameters explained: -U mypostgresqldumpplanet: The name of the PostgreSQL admin user, often this will be postgres; –d myjhipsterplanet: The name of the database you want to restore; -c: In order to clean the database objects before restoring them; /var/lib/postgresql/data/2023-04-01-custom.dump: The name of the backup file. Connect to the database and show the customer with id 1. Shell postgres@4934dd659171:/$ psql -d postgres -U mypostgresqldumpplanet postgres=# \c myjhipsterplanet myjhipsterplanet=# select * from customer where id = 1; id | customer_name | company_id ----+---------------+------------ 1 | Assistent1 | (1 row) As you can see, the value is Assistent1, the value from the backup. 5.3 Directory Restore The directory restore can be done as follows: Shell postgres@4934dd659171:/$ pg_restore -U mypostgresqldumpplanet -d myjhipsterplanet -c /var/lib/postgresql/data/2023-04-01-directory/ The parameters are similar to the custom format. The pg_restore command will figure out for itself which format is used in the backup file. 5.4 Tar Restore And for completeness, the tar restore. Shell postgres@4934dd659171:/$ pg_restore -U mypostgresqldumpplanet -d myjhipsterplanet -c /var/lib/postgresql/data/2023-04-01-tar.tar 6. What Is The Mounted Docker Volume? When you are running PostgreSQL as a Docker container, the directory /var/lib/postgresql/data will be mounted to a host directory. How can you find this directory? First, retrieve the name of the Docker container. Shell $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4934dd659171 postgres:14.5 "docker-entrypoint.s…" 3 hours ago Up 3 hours 127.0.0.1:5432->5432/tcp mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 The Docker inspect command will dump the mounts of the container, and here you will find the mounted volumes. Shell $ docker inspect mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 ... "Mounts": [ { "Type": "bind", "Source": "/home/<user>/docker/volumes/postgresqldata", "Destination": "/var/lib/postgresql/data", "Mode": "rw", "RW": true, "Propagation": "rprivate" } ], ... 7. Clean Up If you followed the steps in this blog, you need to do some cleanup. Stop the PostgreSQL Docker container with the following command executed from the root of the Git repository. Shell $ docker compose -f postgresql.yml down [+] Running 2/2 Container mypostgresqldumpplanet-mypostgresqldumpplanet-postgresql-1 Removed Network mypostgresqldumpplanet_default Removed 8. Conclusion In this post, you learned how to backup and restore a PostgreSQL database. Interesting information on how to optimize backups can be found here.
Real-time data is becoming increasingly important in today's fast-paced business world, as companies seek to gain valuable insights and make informed decisions based on the most up-to-date information available. However, processing and analyzing real-time data can be a challenge, particularly when it comes to joining multiple streams of data together in real-time. In this article, we'll explore the concept of multi-stream joins in SQL, and discuss some tips and techniques for performing these joins effectively using a streaming database. Multi-Stream Joins: What Are They? A multi-stream join involves combining two or more streams of data together in real-time to create a single output stream that reflects the current state of the data. This can be a powerful technique for analyzing real-time data from multiple sources, such as IoT devices, social media feeds, e-commerce apps, or financial markets. In SQL, joins are typically performed using a query that specifies the input streams, the join conditions, and any additional filtering or aggregation functions that are required. The exact syntax for these queries can vary depending on the database system being used, but the basic principles are the same. Imagine you work for a ride-sharing company like Uber that operates in multiple cities. You have a stream of data from your drivers' GPS devices that includes their location, speed, and other relevant information. You also have a stream of data from your customers' mobile apps that includes their location, destination, and other relevant details. To improve the overall customer experience and optimize driver efficiency, you want to join these two data streams together in real-time to gain a better understanding of where your drivers are located, which customers are waiting for rides, and which routes are most efficient. See below a couple of scenarios. Streaming Database for Multi-Stream Joins If you're looking to perform stream-to-stream joins in SQL, a streaming database helps you get the most out of your data. By using the streaming database, you can run SQL queries continuously on single streams, and join two or more streams. Much like other popular RDBMS (relational database management system), a streaming database can join together any two datasets/tables expressions using various sources or materialized views) into a single table expression. The main difference between joins with streaming databases and traditional databases is the nature of the data being processed. In a traditional database, data is typically stored in tables, and queries are run on this stored data at a point in time. On the other hand, in a streaming database, data is processed in real-time as it is being generated, and queries are run on this real-time data stream as data arrives in the form of topics from different message brokers like Kafka. You can read more about how a streaming database differs from a traditional database. In the next section, I use RisingWave as a streaming database and provided some examples of how you could use SQL to perform a multi-stream join. You can find out more about how to choose the right streaming database. RisingWave uses Postgres-compatible SQL as the interface to manage and query data. This guide will walk you through some of the most used SQL commands in RisingWave. Join Streams With RisingWave Imagine you want to analyze your ride-sharing data and you might choose to join the driver data stream and the customer data stream based on the location field, as this would allow you to track which drivers are closest to which customers and ensure that you're dispatching drivers efficiently. The sample data below demonstrate the typical data stream generated by the ride-sharing app: Driver Data Stream driver_id location speed rating event_timestamp 101 San Francisco 60 4 2023-04-01 10:30:00 102 New York 50 5 2023-04-01 10:33:00 103 Los Angeles 45 1 2023-04-01 10:31:00 ... ... ... ... ... Customer Data Stream customer_id pickup_location destination event_timestamp 201 San Francisco Palo Alto 2023-04-01 10:30:00 202 New York Brooklyn 2023-04-01 10:33:00 203 Los Angeles Santa Monica 2023-04-01 10:31:00 ... ... ... ... Creating a Source for a Streaming Source The first thing you do is connect the streaming database to a streaming source. A source is a resource from which RisingWave can read data. The streaming source can be two tables in your relational database (MySQL, PostgreSQL, or another) and you can ingest data using Change Data Capture (CDC) and RisingWave built-in connector, or the source can be a Kafka broker. You can create a source in RisingWave using the CREATE SOURCE command. For example, the mapping for the drivers Kafka topic to RisingWave source might look like this: CREATE SOURCE driver_data ( driver_id BIGINT, location VARCHAR, speed BIGINT, ) WITH ( connector = 'kafka', topic = 'driver_topic', properties.bootstrap.server = 'message_queue:29092', scan.startup.mode = 'earliest' ) ROW FORMAT JSON; And you will have a second source for customer_topic too. Continuous Queries on a Stream Afterward, you can query streams with SQL like you would query them in the ordinary relational database but in the streaming database, the data is shown in real-time as new data is added to the source. This simple equijoin query would select all fields from both data streams and join them based on the location field. SELECT driver_data.*, customer_data.* FROM driver_data JOIN customer_data ON driver_data.location = customer_data.pickup_location You might want to persist all rides-related data in the streaming database. You can create a new table rides in the database that contains information about each ride, including the driver ID, the customer ID, the pickup location, the drop-off location, and the fare amount. In this case, you want to join the incoming continuous drivers data streams with the rides table based on the driver ID. The below join query will allow you to combine information about each driver's location and rating with information about the rides they have completed to identify the most active drivers in certain geographic areas. SELECT driver_data.driver_id, driver_data.location, driver_data.rating, COUNT(ride_data.ride_id) as total_rides FROM driver_data JOIN ride_data ON driver_data.driver_id = ride_data.driver_id WHERE driver_data.location = 'San Francisco' GROUP BY driver_data.driver_id, driver_data.location, driver_data.rating ORDER BY total_rides DESC Result: driver_id location rating total_rides 101 San Francisco 4 2 ... ... ... ... Window Joins in RisingWave Sometimes you are interested in the events during any time intervals. A window join is a type of join operation that is commonly used in streaming databases that allows you to join two streams of data based on a time window. RisingWave offers two types of windows: Tumbling windows Hopping windows For example, you may want to calculate the average speed of drivers within a certain distance from a customer's pickup location, over a rolling window of the past 10 minutes. In this case, your SQL query might look something like this: SELECT customer_data.*, AVG(driver_data.speed) AS avg_speed FROM customer_data JOIN driver_data ON ST_DISTANCE(driver_data.location, customer_data.pickup_location) < 5 GROUP BY TUMBLE(customer_data.event_time, INTERVAL '10' MINUTE), customer_data.customer_id Result: customer_id pickup_location destination event_time avg_speed 201 San Francisco Palo Alto 2023-04-01 10:30:00 60.0 203 Los Angeles Santa Monica 2023-04-01 10:31:00 45.0 202 New York Brooklyn 2023-04-01 10:33:00 50.0 This query would select all fields from the customer data stream and calculate the average speed of drivers within 5 km of the customer's pickup location. In this query, the TUMBLE() function is used to group the data into tumbling time windows of 10 minutes. The GROUP BY clause aggregates the data within each time window and for each customer separately. Write Merged Streams to a Materialized View With the RisingWave streaming database, you can also create materialized views for joined streams. A materialized view is a precomputed snapshot of data that is stored as a table in the streaming database. Materialized views can be particularly useful because they allow you to combine and aggregate data from multiple streams into a single table and the streaming database computes the query results on the fly and updates the virtual table as new data arrives. This can simplify complex queries, improve overall system performance, and responsiveness and provide a more comprehensive view of the data that's easier to work with. In RisingWave, you need to use the CREATE MATERIALIZED VIEW statement to create a materialized source. Here's an example of a materialized view that can be created by merging the Driver and Rider streams in the ride-sharing data example above. CREATE MATERIALIZED VIEW most_active_drivers AS SELECT drivers.driver_id, drivers.location, drivers.rating, COUNT(rides.ride_id) as total_rides FROM drivers JOIN rides ON drivers.driver_id = rides.driver_id WHERE drivers.location = 'San Francisco' GROUP BY drivers.driver_id, drivers.location, drivers.rating ORDER BY total_rides DESC The materialized view result: driver_id location rating total_rides 101 San Francisco 4 2 104 San Francisco 3 1 Takeaways With a streaming database, you can join two or multiple streams by ingesting them from different data sources. You can join tables by table reference, type, and table functions like JOIN. It is also possible to join multiple streams based on a time window using the window joins functions like Tumble or Hop. The resulting stream would contain the combined data from all streams which means this operation performs expensive calculations. In this case, you can create a materialized view to speed up query performance. Related Resources Shared Indexes and Joins in Streaming Databases Query Real-Time Data in Kafka Using SQL
Building a cluster of single-board mini-computers is an excellent way to explore and learn about distributed computing. With the scarcity of Raspberry Pi boards, and the prices starting to get prohibitive for some projects, alternatives such as Orange Pi have gained popularity. In this article, I’ll show you how to build a (surprisingly cheap) 4-node cluster packed with 16 cores and 4GB RAM to deploy a MariaDB replicated topology that includes three database servers and a database proxy, all running on a Docker Swarm cluster and automated with Ansible. This article was inspired by a member of the audience who asked my opinion about Orange Pi during a talk I gave in Colombia. I hope this completes the answer I gave you. What Is a Cluster? A cluster is a group of computers that work together to achieve a common goal. In the context of distributed computing, a cluster typically refers to a group of computers that are connected to each other and work together to perform computation tasks. Building a cluster allows you to harness the power of multiple computers to solve problems that a single computer cannot handle. For example, a database can be replicated in multiple nodes to achieve high availability—if one node fails, other nodes can take over. It can also be used to implement read/write splitting to make one node handle writes, and another reads in order to achieve horizontal scalability. What Is Orange Pi Zero2? The Orange Pi Zero2 is a small single-board computer that runs on the ARM Cortex-A53 quad-core processor. It has 512MB or 1GB of DDR3 RAM, 100Mbps Ethernet, Wi-Fi, and Bluetooth connectivity. The Orange Pi Zero2 is an excellent choice for building a cluster due to its low cost, small size, and good performance. The only downside I found was that the Wi-Fi connection didn’t seem to perform as well as with other single-board computers. From time to time, the boards disconnect from the network, so I had to place them close to a Wi-Fi repeater. This could be a problem with my setup or with the boards. I’m not entirely sure. Having said that, this is not a production environment, so it worked pretty well for my purposes. What You Need Here are the ingredients: Orange Pi Zero2: I recommend the 1GB RAM variant and try to get at least 4 of them. I recently bought 4 of them for €30 each. Not bad at all! Give it a try! MicroSD cards: One per board. Try to use fast ones — it will make quite a difference in performance! I recommend at least 16GB. For reference, I used SanDisk Extreme Pro Micro/SDXC with 32GB, which offers a write speed of 90 MB/s and reads at 170 MB/s. A USB power hub: To power the devices, I recommend a dedicated USB power supply. You could also just use individual chargers, but the setup will be messier and require a power strip with as many outlets as devices as you have. It’s better to use a USB multi-port power supply. I used an Anker PowerPort 6, but there are also good and cheaper alternatives. You’ll have to Google this too. Check that each port can supply 5V and at least 2.4A. USB cables: Each board needs to be powered via a USB-C port. You need a cable with one end of type USB-C and the other of the type your power hub accepts. Bolts and nuts: To stack up the boards. Heat sinks (optional): These boards can get hot. I recommend getting heat sinks to help with heat dissipation. Materials needed for building an Orange Pi Zero2 cluster Assembling the Cluster One of the fun parts of building this cluster is the physical assembly of the boards on a case or some kind of structure that makes them look like a single manageable unit. Since my objective here is to keep the budget as low as possible, I used cheap bolts and nuts to stack the boards one on top of the other. I didn’t find any ready-to-use cluster cases for the Orange Pi Zero2. One alternative is to 3D-print your own case. When stacking the boards together, keep an eye on the antenna placement. Avoid crushing the cable, especially if you installed heat sinks. An assembled Orange Pi Zero2 cluster with 4 nodes Installing the Operating System The second step is to install the operating system on each microSD card. I used Armbian bullseye legacy 4.9.318. Download the file and use a tool like balenaEtcher to make bootable microSD cards. Download and install this tool on your computer. Select the Armbian image file and the drive that corresponds to the micro SD card. Flash the image and repeat the process for each micro SD card. Configuring Orange Pi WiFi Connection (Headless) To configure the Wi-Fi connection, Armbian includes the /boot/armbian_first_run.txt.template file which allows you to configure the operating system when it runs for the first time. The template includes instructions, so it’s worth checking. You have to rename this file to armbian_first_run.txt. Here’s what I used: Plain Text FR_general_delete_this_file_after_completion=1 FR_net_change_defaults=1 FR_net_ethernet_enabled=0 FR_net_wifi_enabled=1 FR_net_wifi_ssid='my_connection_id>' FR_net_wifi_key='my_password' FR_net_wifi_countrycode='FI' FR_net_use_static=1 FR_net_static_gateway='192.168.1.1' FR_net_static_mask='255.255.255.0' FR_net_static_dns='192.168.1.1 8.8.8.8' FR_net_static_ip='192.168.1.181' Use your own Wi-Fi details, including connection name, password, country code, gateway, mask, and DNS. I wasn’t able to read the SD card from macOS. I had to use another laptop with Linux on it to make the changes to the configuration file on each SD card. To mount the SD card on Linux, run the following command before and after inserting the SD card and see what changes: Shell sudo fdisk -l I created a Bash script to automate the process. The script accepts as a parameter the IP to set. For example: Shell sudo ./armbian-setup.sh 192.168.1.181 I run this command on each of the four SD cards changing the IP address from 192.168.1.181 to 192.168.1.184. Connecting Through SSH Insert the flashed and configured micro SD cards on each board and turn the power supply on. Be patient! Give the small devices time to boot. It can take several minutes the first time you boot them. An Orange Pi cluster running Armbian Use the ping command to check whether the devices are ready and connected to the network: Shell ping 192.168.1.181 Once they respond, connect to the mini-computers through SSH using the root user and the IP address that you configured. For example: Shell ssh root@192.168.1.181 The default password is: Plain Text 1234 You’ll be presented with a wizard-like tool to complete the installation. Follow the steps to finish the configuration and repeat the process for each board. Installing Ansible Imagine you want to update the operating system on each machine. You’ll have to log into a machine and run the update command and end the remote session. Then repeat for each machine in the cluster. A tedious job even if you have only 4 nodes. Ansible is an automation tool that allows you to run a command on multiple machines using a single call. You can also create a playbook, a file that contains commands to be executed in a set of machines defined in an inventory. Install Ansible on your working computer and generate a configuration file: Shell sudo su ansible-config init --disabled -t all > /etc/ansible/ansible.cfg exit In the /etc/ansible/ansible.cfg file, set the following properties (enable them by removing the semicolon): Plain Text host_key_checking=False become_allow_same_user=True ask_pass=True This will make the whole process easier. Never do this in a production environment! You also need an inventory. Edit the /etc/ansible/hosts file and add the Orange Pi nodes as follows: Plain Text ############################################################################## # 4-node Orange Pi Zero 2 cluster ############################################################################## [opiesz] 192.168.1.181 ansible_user=orangepi hostname=opiz01 192.168.1.182 ansible_user=orangepi hostname=opiz02 192.168.1.183 ansible_user=orangepi hostname=opiz03 192.168.1.184 ansible_user=orangepi hostname=opiz04 [opiesz_manager] opiz01.local ansible_user=orangepi [opiesz_workers] opiz[02:04].local ansible_user=orangepi In the ansible_user variable, specify the username that you created during the installation of Armbian. Also, change the IP addresses if you used something different. Setting up a Cluster With Ansible Playbooks A key feature of a computer cluster is that the nodes should be somehow logically interconnected. Docker Swarm is a container orchestration tool that will convert your arrangement of Orange Pi devices into a real cluster. You can later deploy any kind of server software. Docker Swarm will automatically pick one of the machines to host the software. To make the process easier, I have created a set of Ansible playbooks to further configure the boards, update the packages, reboot or power off the machines, install Docker, set up Docker Swarm, and even install a MariaDB database with replication and a database cluster. Clone or download this GitHub repository: Shell git clone https://github.com/alejandro-du/orange-pi-zero-cluster-ansible-playbooks.git Let’s start by upgrading the Linux packages on all the boards: Shell ansible-playbook upgrade.yml --ask-become-pass Now configure the nodes to have an easy-to-remember hostname with the help of Avahi, and configure the LED activity (red LED activates on SD card activity): Shell ansible-playbook configure-hosts.yml --ask-become-pass Reboot all the boards: Shell ansible-playbook reboot.yml --ask-become-pass Install Docker: Shell ansible-playbook docker.yml --ask-become-pass Set up Docker Swarm: Shell ansible-playbook docker-swarm.yml --ask-become-pass Done! You have an Orange Pi cluster ready for fun! Deploying MariaDB on Docker Swarm I have to warn you here. I don’t recommend running a database on container orchestration software. That’s Docker Swarm, Kubernetes, and others. Unless you are willing to put a lot of effort into it. This article is a lab. A learning exercise. Don’t do this in production! Now let’s get back to the fun… Run the following to deploy one MariaDB primary server, two MariaDB replica servers, and one MaxScale proxy: Shell ansible-playbook mariadb-stack.yml --ask-become-pass The first time you do this, it will take some time. Be patient. SSH into the manager node: Shell ssh orangepi@opiz01.local Inspect the nodes in the Docker Swarm cluster: Shell docker node ls Inspect the MariaDB stack: Shell docker stack ps mariadb A cooler way to inspect the containers in the cluster is by using the Docker Swarm Visualizer. Deploy it as follows: Shell docker service create --name=viz --publish=9000:8080 --constraint=node.role==manager --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock alexellis2/visualizer-arm:latest On your working computer, open a web browser and go to this URL. You should see all the nodes in the cluster and the deployed containers. Docker Swarm Visualizer showing MariaDB deployed MaxScale is an intelligent database proxy with tons of features. For now, let’s see how to connect to the MariaDB cluster through this proxy. Use a tool like DBeaver, DbGate, or even a database extension for your favorite IDE. Create a new database connection using the following connection details: Host: opiz01.local Port: 4000 Username: user Password: password Create a new table: MariaDB SQL USE demo; CREATE TABLE messages( id INT PRIMARY KEY AUTO_INCREMENT, content TEXT NOT NULL ); Insert some data: MariaDB SQL INSERT INTO messages(content) VALUES ("It works!"), ("Hello, MariaDB"), ("Hello, Orange Pi"); When you execute this command, MaxScale sends it to the primary server. Now read the data: MariaDB SQL SELECT * FROM messages; When you execute this command, MaxScale sends it to one of the replicas. This division of reads and writes is called read-write splitting. The MaxScale UI showing a MariaDB cluster with replication and read-write splitting You can also access the MaxScale UI. Use the following credentials: Username: admin Password: mariadb Watch the following video if you want to learn more about MaxScale and its features. You won’t regret it!
I have developed a small-scale application that concentrates on a straightforward business scenario of customer reviews for the purpose of illustrating various use cases. This application is implemented using the Spring Boot framework and communicates with a MySQL database via Spring Data JPA, with the code being written in Kotlin. It exposes a simple REST API featuring CRUD operations on reviews. Spoiler alert: The use cases illustrated in this article are intentionally simplistic, intended solely to showcase the integration with ShardingSphere functionalities; the discussed problems here can be solved in various ways and maybe even in better ways, so don't spend too much on thinking "why." So, without further ado, let's dive into code. Here is the main entity: Kotlin @Entity @Table(name = "reviews") data class Review( var text: String, var author: String, @Column(name = "author_telephone") var authorTelephone: String? = null, @Column(name = "author_email") var authorEmail: String? = null, @Column(name = "invoice_code") var invoiceCode: String? = null, @Column(name = "course_id") var courseId: Int? = null ) : AbstractEntity() And we have the following REST API: Kotlin @RestController @RequestMapping("/api/v1/reviews") class ReviewController(val reviewService: ReviewService) { @GetMapping("/filter", params = ["author"]) fun getReviewsByAuthor(@RequestParam("author") author: String): ResponseEntity<List<Review>> { val reviews = reviewService.findAllByAuthor(author) return ResponseEntity.ok(reviews) } @GetMapping("/filter", params = ["courseId"]) fun getReviewsByCourseId(@RequestParam("courseId") courseId: Int): ResponseEntity<List<Review>> { val reviews = reviewService.findAllByCourseId(courseId) return ResponseEntity.ok(reviews) } @GetMapping("/{id}") fun getReviewById(@PathVariable("id") id: Int): ResponseEntity<Review> { val review = reviewService.findById(id) return ResponseEntity.ok(review) } @PostMapping fun createReview(@RequestBody review: Review): ResponseEntity<Review> { val savedReview = reviewService.save(review) return ResponseEntity.status(HttpStatus.CREATED).body(savedReview) } @DeleteMapping("/{id}") fun deleteReviewById(@PathVariable("id") id: Int): ResponseEntity<Unit> { reviewService.deleteById(id) return ResponseEntity.noContent().build() } } Here is the MySQL container from docker-compose.yml: YAML mysql-master: image: 'bitnami/mysql:latest' ports: - '3306:3306' volumes: - 'mysql_master_data:/bitnami/mysql/data' - ./init.sql:/docker-entrypoint-initdb.d/init.sql environment: - MYSQL_USER=my_user - MYSQL_PASSWORD=my_password - MYSQL_DATABASE=reviews-db Note: The init.sql contains DDL for the reviews table. Now if we are to execute some HTTP requests like creating a review (all the mentioned requests are in requests.http). Plain Text POST http://localhost:8070/api/v1/reviews/ Content-Type: application/json { "text": "This is a great course!", "author": "John Doe", "authorTelephone": "555-1234", "authorEmail": "johndoe@example.com", "invoiceCode": "ABC123", "courseId": 123 } We'll observe the following query (p6spy enabled): Plain Text INFO 16784 --- [nio-8070-exec-1] p6spy : #1681730984533 | took 4ms | statement | connection 2| url jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false insert into reviews (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values ('2023-04-17T14:29:44.450+0300', '2023-04-17T14:29:44.450+0300', 'John Doe', 'johndoe@example.com', '555-1234', 123, 'ABC123', 'This is a great course!', 2); Data Sharding The reviews application discussed above involves a database storing a large number of reviews for courses. As the number of courses and reviews grows, the reviews table in the database can become very large, making it difficult to manage and slowing down performance. To address this issue, we can implement data sharding, which involves breaking up the reviews table into smaller, more manageable pieces called shards. Each shard contains a subset of the data in the reviews table, with each shard being responsible for a specific range of data based on a shard key, such as the course id. Table sharding can help the reviews application manage its growing reviews table more effectively, improving performance and scalability while also making backups and maintenance tasks easier to manage. But, there is always a but – implementing table sharding manually can be a complex and challenging task. It requires a deep understanding of database design and architecture, as well as knowledge of the specific sharding implementation being used. There are a lot of challenges that can arise when implementing table sharding in a Spring Boot application. It is time we met ShardingSphere: Apache ShardingSphere is an ecosystem to transform any database into a distributed database system, and enhance it with sharding, elastic scaling, encryption features and more. Apache ShardingSphere comes in two flavors: ShardingSphere-JDBC is a lightweight Java framework that provides additional services at Java's JDBC layer. ShardingSphere-Proxy is a transparent database proxy, providing a database server that encapsulates database binary protocol to support heterogeneous languages. ShardingSphere provides a range of features, including data sharding, distributed transaction, read/write splitting, high availability, data migration, query federation, data encryption, and shadow database for full-link online load testing scenarios, to help manage large volumes of data and ensure data integrity and security. For now, we'll focus on ShardingSphere-JDBC's data sharding, and we'll use the following dependencies: Kotlin implementation("org.apache.shardingsphere:shardingsphere-jdbc-core:5.3.2") implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-core:5.3.2") implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-repository-zookeeper:5.3.2") implementation("org.apache.shardingsphere:shardingsphere-cluster-mode-repository-api:5.3.2") Note: The ShardingSphere team had starters for spring boot in 5.1.x versions, but they moved away from starters in favor of consistency in their project and now are recommending using the latest version (non-starter), which can be configured a bit differently, but still fairly simple. In my repo, through the commits, you can find examples of spring boot starter configuration too. ShardingSphere-JDBC can be configured mainly in two ways: YAML configuration and Java configuration. I picked the YAML configuration for this article. So at the moment, my datasource configuration from application.yaml looks like this: YAML spring: datasource: username: my_user password: my_password url: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false tomcat: validation-query: "SELECT 1" test-while-idle: true jpa: properties: hibernate: dialect: org.hibernate.dialect.MySQL8Dialect open-in-view: false hibernate: ddl-auto: none To enable ShardingSphere JDBC, we'll have to make it look like this: YAML spring: datasource: driver-class-name: org.apache.shardingsphere.driver.ShardingSphereDriver url: jdbc:shardingsphere:classpath:sharding.yaml jpa: properties: hibernate: dialect: org.hibernate.dialect.MySQL8Dialect open-in-view: false hibernate: ddl-auto: none We specified that the driver being used for the data source will be the ShardingSphereDriver and the url should be picked based on this file sharding.yaml Okay, pretty simple, right? Let's continue; let's create the sharding.yaml YAML dataSources: master: dataSourceClassName: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false username: my_user password: my_password connectionTimeoutMilliseconds: 30000 idleTimeoutMilliseconds: 60000 maxLifetimeMilliseconds: 1800000 maxPoolSize: 65 minPoolSize: 1 mode: type: Standalone repository: type: JDBC rules: - !SHARDING tables: reviews: actualDataNodes: master.reviews_$->{0..1} tableStrategy: standard: shardingColumn: course_id shardingAlgorithmName: inline shardingAlgorithms: inline: type: INLINE props: algorithm-expression: reviews_$->{course_id % 2} allow-range-query-with-inline-sharding: true props: proxy-hint-enabled: true sql-show: true Now let's analyze the most important properties: dataSources.master – here lies the definition of our master data source. mode – which can be either standalone with JDBC type or cluster with Zookeeper type (recommended for production), which is used for configuration information persistence rules– here, we can enable various ShardingSphere features like - !SHARDING tables.reviews – here, we describe the actual tables based on the inline syntax rules, meaning that we'll have two tables reviews_0 and reviews_1 sharded by the course_id column. shardingAlgorithms – here, we describe the manual inline sharding algorithm via a groovy expression telling that the reviews table is divided into two tables based on the course_id column. props – here, we enabled intercepting/formatting sql queries (p6spy can be disabled/commented). Important: Before starting our application, we need to make sure that our defined shards are created, so I created two tables in my database: reviews_0 and reviews_1 (init.sql). Now we are ready to start our application and do some requests: Plain Text ### POST a new review POST http://localhost:8070/api/v1/reviews/ Content-Type: application/json { "text": "This is a great course!", "author": "John Doe", "authorTelephone": "555-1234", "authorEmail": "johndoe@example.com", "invoiceCode": "ABC123", "courseId": 123 } We can see the following log: Plain Text INFO 35412 --- [nio-8070-exec-2] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_1 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:42:01.8069745, 2023-04-17 15:42:01.8069745, John Doe, johndoe@example.com, 555-1234, 123, ABC123, This is a great course!, 4] If we are to execute one more request with a different payload: Plain Text INFO 35412 --- [nio-8070-exec-8] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_1 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:43:47.3267788, 2023-04-17 15:43:47.3267788, Mike Scott, mikescott@example.com, 555-1234, 123, ABC123, This is an amazing course!, 5] We can notice that both our reviews placed by Mike and John went into reviews_1 table, what if we are to change the course_id to 124 and execute the same POST request again? Plain Text INFO 35412 --- [nio-8070-exec-4] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 15:44:42.7133688, 2023-04-17 15:44:42.7133688, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 6] We can see that our new review got saved in reviews_0 table. Now we can execute two GET requests based on the course_id Plain Text ### GET reviews by course ID GET http://localhost:8070/api/v1/reviews/filter?courseId=123 GET http://localhost:8070/api/v1/reviews/filter?courseId=124 And observe in the logs how routing between our two tables took place. Plain Text INFO 35412 --- [nio-8070-exec-9] ShardingSphere-SQL: Actual SQL: master ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_1 review0_ where review0_.course_id=? ::: [123] INFO 35412 --- [nio-8070-exec-5] ShardingSphere-SQL: Actual SQL: master ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124] The first select was directed to reviews_1 table and the second one to reviews_0 - Sharding in action! Read-Write Splitting Now let's imagine another problem, the reviews application time may experience high stress during peak hours, leading to slow response times and decreased user experience. To address this issue, we can implement read/write splitting to balance the load and improve performance. And how lucky we are that ShardingSphere offers us a read/write splitting solution. Read-write splitting involves directing read queries to replica databases and write queries to a master database, ensuring that read requests do not interfere with write requests and that database performance is optimized. Before configuring the read-write splitting solution, we'll have to make some changes to our docker-compose in order to have some replicas for our master db (credits to bitnami for providing this): YAML mysql-master: image: 'bitnami/mysql:latest' ports: - '3306:3306' volumes: - 'mysql_master_data:/bitnami/mysql/data' - ./init.sql:/docker-entrypoint-initdb.d/init.sql environment: - MYSQL_REPLICATION_MODE=master - MYSQL_REPLICATION_USER=repl_user - MYSQL_REPLICATION_PASSWORD=repl_password - MYSQL_ROOT_PASSWORD=master_root_password - MYSQL_USER=my_user - MYSQL_PASSWORD=my_password - MYSQL_DATABASE=reviews-db mysql-slave: image: 'bitnami/mysql:latest' ports: - '3306' depends_on: - mysql-master environment: - MYSQL_USER=my_user - MYSQL_PASSWORD=my_password - MYSQL_REPLICATION_MODE=slave - MYSQL_REPLICATION_USER=repl_user - MYSQL_REPLICATION_PASSWORD=repl_password - MYSQL_MASTER_HOST=mysql-master - MYSQL_MASTER_PORT_NUMBER=3306 - MYSQL_MASTER_ROOT_PASSWORD=master_root_password Let's start our containers like this (one master and two replicas): docker-compose up --detach --scale mysql-master=1 --scale mysql-slave=2 Now we need the mapped ports for our slaves. $ docker port infra-mysql-slave-1 3306/tcp -> 0.0.0.0:49923 $ docker port infra-mysql-slave-2 3306/tcp -> 0.0.0.0:49922 Okay, now that we have our master and its replicas in place, we are ready to configure two new dataSources like this: YAML dataSources: master: dataSourceClassName: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:3306/reviews-db?allowPublicKeyRetrieval=true&useSSL=false username: my_user password: my_password connectionTimeoutMilliseconds: 30000 idleTimeoutMilliseconds: 60000 maxLifetimeMilliseconds: 1800000 maxPoolSize: 65 minPoolSize: 1 slave0: dataSourceClassName: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:49922/reviews-db?allowPublicKeyRetrieval=true&useSSL=false username: my_user password: my_password connectionTimeoutMilliseconds: 30000 idleTimeoutMilliseconds: 60000 maxLifetimeMilliseconds: 1800000 maxPoolSize: 65 minPoolSize: 1 slave1: dataSourceClassName: com.zaxxer.hikari.HikariDataSource driverClassName: com.mysql.jdbc.Driver jdbcUrl: jdbc:mysql://localhost:49923/reviews-db?allowPublicKeyRetrieval=true&useSSL=false username: my_user password: my_password connectionTimeoutMilliseconds: 30000 idleTimeoutMilliseconds: 60000 maxLifetimeMilliseconds: 1800000 maxPoolSize: 65 minPoolSize: 1 And then, we can add the read-write splitting rule to the rules: YAML - !READWRITE_SPLITTING dataSources: readwrite_ds: staticStrategy: writeDataSourceName: master readDataSourceNames: - slave0 - slave1 loadBalancerName: readwrite-load-balancer loadBalancers: readwrite-load-balancer: type: ROUND_ROBIN Here I think everything is self-explanatory: we have specified the written data source name to be the master and the read data sources to point to our slaves: slave0 and slave1; and we picked a round-robin load balancer algorithm. Important: One last change to be made is regarding the sharding rule, which knows nothing about the newly configured read-write splitting rule and points directly to the master: YAML - !SHARDING tables: reviews: actualDataNodes: readwrite_ds.reviews_$->{0..1} Now our sharding we'll be wrapped by the read-write splitting rule, too, and the data source decision will be made before picking the correct table (pay attention to readwrite_ds.reviews_$->{0..1}). Okay, we can start our application, run the same POST request and observe the logs: Plain Text INFO 22860 --- [nio-8070-exec-1] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 16:12:07.25473, 2023-04-17 16:12:07.25473, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 7] Nothing surprising here, the sharding still works, and the query took place in the master data source (write data source). But if we are to run a couple of GET requests, we'll observe the following: Plain Text INFO 22860 --- [nio-8070-exec-2] ShardingSphere-SQL: Actual SQL: slave0 ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124] INFO 22860 --- [nio-8070-exec-4] ShardingSphere-SQL: Actual SQL: slave1 ::: select review0_.id as id1_0_, review0_.created_at as created_2_0_, review0_.last_modified_at as last_mod3_0_, review0_.author as author4_0_, review0_.author_email as author_e5_0_, review0_.author_telephone as author_t6_0_, review0_.course_id as course_i7_0_, review0_.invoice_code as invoice_8_0_, review0_.text as text9_0_ from reviews_0 review0_ where review0_.course_id=? ::: [124] You can observe read-write splitting in action; our write queries take place in master data sources, but our read queries take place in master's replicas (slave0 and slave1) and this while maintaining the correct sharding rule. Data Masking Off to another imaginary problem regarding our application. Imagine that sensitive information such as customer email, phone number, and invoice code may need to be accessed by certain users or applications while remaining hidden from others due to data privacy regulations. To address this issue, we could implement a data masking solution to mask sensitive data when mapping the result or at the SQL level. But as you guessed, why bother? ShardingSphere is here to save the day with another easy-to-enable feature – data masking. So let's update our configuration with this new rule: YAML - !MASK tables: reviews: columns: invoice_code: maskAlgorithm: md5_mask author_email: maskAlgorithm: mask_before_special_chars_mask author_telephone: maskAlgorithm: keep_first_n_last_m_mask maskAlgorithms: md5_mask: type: MD5 mask_before_special_chars_mask: type: MASK_BEFORE_SPECIAL_CHARS props: special-chars: '@' replace-char: '*' keep_first_n_last_m_mask: type: KEEP_FIRST_N_LAST_M props: first-n: 3 last-m: 2 replace-char: '*' Let's see what we have here: table.reviews – we defined three masking algorithms for each column mentioned before maskAlgorithms.md5_mask – we specified the MD5 algorithm type for invoice_code maskAlgorithms.mask_before_special_chars_mask – we configured the MASK_BEFORE_SPECIAL_CHARS algorithm for the author_email column, meaning that all the characters before @ symbol will be replaced with the * symbol. maskAlgorithms.keep_first_n_last_m_mask – we configured the KEEP_FIRST_N_LAST_M algorithm for author_telephone column, meaning that only the first 3 and last 2 characters of a telephone number will stay unchanged; everything in between will be masked by the * symbol. Note: You can find a lot of other masking algorithms here. All right, let's start our application and do the same POST request. Plain Text INFO 35296 --- [nio-8070-exec-1] ShardingSphere-SQL: Actual SQL: master ::: insert into reviews_0 (created_at, last_modified_at, author, author_email, author_telephone, course_id, invoice_code, text, id) values (?, ?, ?, ?, ?, ?, ?, ?, ?) ::: [2023-04-17 16:26:51.8188306, 2023-04-17 16:26:51.8188306, Mike Scott, mikescott@example.com, 555-1234, 124, ABC123, This is an amazing course!, 9] We'll see nothing new here; the exact values that we provided in the body are the ones that got in the database. You can check the master/slave db, too, for that. But the magic comes in if we want to execute our GET request, which gives us the following body. JSON [ { "text": "This is an amazing course!", "author": "Mike Scott", "authorTelephone": "555***34", "authorEmail": "*********@example.com", "invoiceCode": "bbf2dead374654cbb32a917afd236656", "courseId": 124, "id": 9, "lastModifiedAt": "2023-04-17T15:44:43" }, ] As you can see, the data stays unchanged in the database, but when queried and delivered, the telephone, email, and invoice codes got masked according to our defined algorithm in the data masking rule. Conclusion That's it for today, folks! I hope this article gave you a good understanding of how ShardingSphere can be used for database sharding, read-write splitting, and data masking with Spring Boot. We covered the basics of ShardingSphere, how to configure it for sharding, how to use read-write splitting, and how to apply data masking to sensitive data. Make sure to check out all the features that ShardingSphere provides and its amazing documentation here. All the code mentioned in this article can be found here Happy coding!
Developing scalable and reliable applications is a labor of love. A cloud-native system might consist of unit tests, integration tests, build tests, and a full pipeline for building and deploying applications at the click of a button. A number of intermediary steps might be required to ship a robust product. With distributed and containerized applications flooding the market, so too have container orchestration tools like Kubernetes. Kubernetes allows us to build distributed applications across a cluster of nodes, with fault tolerance, self-healing, and load balancing — plus many other features. Let’s explore some of these tools by building a distributed to-do list application in Node.js, backed by the YugabyteDB distributed SQL database. Getting Started A production deployment will likely involve setting up a full CI/CD pipeline to push containerized builds to the Google Container Registry to run on Google Kubernetes Engine or similar cloud services. For demonstration purposes, let’s focus on running a similar stack locally. We’ll develop a simple Node.js server, which is built as a docker image to run on Kubernetes on our machines. We’ll use this Node.js server to connect to a YugabyteDB distributed SQL cluster and return records from a rest endpoint. Installing Dependencies We begin by installing some dependencies for building and running our application. Docker Desktop Docker is used to build container images, which we’ll host locally. Minikube Creates a local Kubernetes cluster for running our distributed and application YugabyteDB Managed Next, we create a YugabyteDB Managed account and spin up a cluster in the cloud. YugabyteDB is PostgreSQL-compatible, so you can also run a PostgreSQL database elsewhere or run YugabyteDB locally if desired. For high availability, I’ve created a 3-node database cluster running on AWS, but for demonstration purposes, a free single-node cluster works fine. Seeding Our Database Once our database is up and running in the cloud, it’s time to create some tables and records. YugabyteDB Managed has a cloud shell that can be used to connect via the web browser, but I’ve chosen to use the YugabyteDB client shell on my local machine. Before connecting, we need to download the root certificate from the cloud console. I’ve created a SQL script to use to create a todos table and some records. SQL CREATE TYPE todo_status AS ENUM ('complete', 'in-progress', 'incomplete'); CREATE TABLE todos ( id serial PRIMARY KEY, description varchar(255), status todo_status ); INSERT INTO todos (description, status) VALUES ( 'Learn how to connect services with Kuberenetes', 'incomplete' ), ( 'Build container images with Docker', 'incomplete' ), ( 'Provision multi-region distributed SQL database', 'incomplete' ); We can use this script to seed our database. Shell > ./ysqlsh "user=admin \ host=<DATABASE_HOST> \ sslmode=verify-full \ sslrootcert=$PWD/root.crt" -f db.sql With our database seeded, we’re ready to connect to it via Node.js. Build a Node.js Server It’s simple to connect to our database with the node-postgres driver. YugabyteDB has built on top of this library with the YugabyteDB Node.js Smart Driver, which comes with additional features that unlock the powers of distributed SQL, including load-balancing and topology awareness. Shell > npm install express > npm install @yugabytedb/pg JavaScript const express = require("express"); const App = express(); const { Pool } = require("@yugabytedb/pg"); const fs = require("fs"); let config = { user: "admin", host: "<DATABASE_HOST>", password: "<DATABASE_PASSWORD>", port: 5433, database: "yugabyte", min: 5, max: 10, idleTimeoutMillis: 5000, connectionTimeoutMillis: 5000, ssl: { rejectUnauthorized: true, ca: fs.readFileSync("./root.crt").toString(), servername: "<DATABASE_HOST>", }, }; const pool = new Pool(config); App.get("/todos", async (req, res) => { try { const data = await pool.query("select * from todos"); res.json({ status: "OK", data: data?.rows }); } catch (e) { console.log("error in selecting todos from db", e); res.status(400).json({ error: e }); } }); App.listen(8000, () => { console.log("App listening on port 8000"); }); Containerizing Our Node.js Application To run our Node.js application in Kubernetes, we first need to build a container image. Create a Dockerfile in the same directory. Dockerfile FROM node:latest WORKDIR /app COPY . . RUN npm install EXPOSE 8000 ENTRYPOINT [ "npm", "start" ] All of our server dependencies will be built into the container image. To run our application using the npm start command, update your package.json file with the start script. JSON … "scripts": { "start": "node index.js" } … Now, we’re ready to build our image with Docker. Shell > docker build -t todo-list-app . Sending build context to Docker daemon 458.4MB Step 1/6 : FROM node:latest ---> 344462c86129 Step 2/6 : WORKDIR /app ---> Using cache ---> 49f210e25bbb Step 3/6 : COPY . . ---> Using cache ---> 1af02b568d4f Step 4/6 : RUN npm install ---> Using cache ---> d14416ffcdd4 Step 5/6 : EXPOSE 8000 ---> Using cache ---> e0524327827e Step 6/6 : ENTRYPOINT [ "npm", "start" ] ---> Using cache ---> 09e7c61855b2 Successfully built 09e7c61855b2 Successfully tagged todo-list-app:latest Our application is now packaged and ready to run in Kubernetes. Running Kubernetes Locally With Minikube To run a Kubernetes environment locally, we’ll run Minikube, which creates a Kubernetes cluster inside of a Docker container running on our machine. Shell > minikube start That was easy! Now we can use the kubectl command-line tool to deploy our application from a Kubernetes configuration file. Deploying to Kubernetes First, we create a configuration file called kubeConfig.yaml which will define the components of our cluster. Kubernetes deployments are used to keep pods running and up-to-date. Here we’re creating a cluster of nodes running the todo-app container that we’ve already built with Docker. YAML apiVersion: apps/v1 kind: Deployment metadata: name: todo-app-deployment labels: app: todo-app spec: selector: matchLabels: app: todo-app replicas: 3 template: metadata: labels: app: todo-app spec: containers: - name: todo-server image: todo ports: - containerPort: 8000 imagePullPolicy: Never In the same file, we’ll create a Kubernetes service, which is used to set the networking rules for your application and expose it to clients. YAML --- apiVersion: v1 kind: Service metadata: name: todo-app-service spec: type: NodePort selector: app: todo-app ports: - name: todo-app-service-port protocol: TCP port: 8000 targetPort: 8000 nodePort: 30100 Let’s use our configuration file to create our todo-app-deployment and todo-app-service. This will create a networked cluster, resilient to failures and orchestrated by Kubernetes! Shell > kubectl create -f kubeConfig.yaml Accessing Our Application in Minikube Shell > minikube service todo-app-service --url Starting tunnel for service todo-app-service. Because you are using a Docker driver on darwin, the terminal needs to be open to run it. We can find the tunnel port by executing the following command. Shell > ps -ef | grep docker@127.0.0.1 503 2363 2349 0 9:34PM ttys003 0:00.01 ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -N docker@127.0.0.1 -p 53664 -i /Users/bhoyer/.minikube/machines/minikube/id_rsa -L 63650:10.107.158.206:8000 The output indicates that our tunnel is running at port 63650. We can access our /todos endpoint via this URL in the browser or via a client. Shell > curl -X GET http://127.0.0.1:63650/todos -H 'Content-Type: application/json' {"status":"OK","data":[{"id":1,"description":"Learn how to connect services with Kuberenetes","status":"incomplete"},{"id":2,"description":"Build container images with Docker","status":"incomplete"},{"id":3,"description":"Provision multi-region distributed SQL database","status":"incomplete"}]} Wrapping Up With a distributed infrastructure in place in our application and database tiers, we’ve developed a system built to scale and survive. I know, I know, I promised you the most resilient to-do app the world has ever seen and didn’t provide a user interface. Well, that’s your job! Extend the API service we’ve developed in Node.js to serve the HTML required to display our list. Look out for more from me on Node.js and distributed SQL — until then, keep on coding!
Managing concurrent access to shared data can be a challenge, but by using the right locking strategy, you can ensure that your applications run smoothly and avoid conflicts that could lead to data corruption or inconsistent results. In this article, we'll explore how to implement pessimistic and optimistic locking using Kotlin, Ktor, and jOOQ, and provide practical examples to help you understand when to use each approach. Whether you are a beginner or an experienced developer, the idea is to walk away with insights into the principles of concurrency control and how to apply them in practice. Data Model Let's say we have a table called users in our MySQL database with the following schema: SQL CREATE TABLE users ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(255) NOT NULL, age INT NOT NULL, PRIMARY KEY (id) ); Pessimistic Locking We want to implement pessimistic locking when updating a user's age, which means we want to lock the row for that user when we read it from the database and hold the lock until we finish the update. This ensures that no other transaction can update the same row while we're working on it. First, we need to ask jOOQ to use pessimistic locking when querying the users table. We can do this by setting the forUpdate() flag on the SELECT query: Kotlin val user = dslContext.selectFrom(USERS) .where(USERS.ID.eq(id)) .forUpdate() .fetchOne() This will lock the row for the user with the specified ID when we execute the query. Next, we can update the user's age and commit the transaction: Kotlin dslContext.update(USERS) .set(USERS.AGE, newAge) .where(USERS.ID.eq(id)) .execute() transaction.commit() Note that we need to perform the update within the same transaction that we used to read the user's row and lock it. This ensures that the lock is released when the transaction is committed. You can see how this is done in the next section. Ktor Endpoint Finally, here's an example Ktor endpoint that demonstrates how to use this code to update a user's age: Kotlin post("/users/{id}/age") { val id = call.parameters["id"]?.toInt() ?: throw BadRequestException("Invalid ID") val newAge = call.receive<Int>() dslContext.transaction { transaction -> val user = dslContext.selectFrom(USERS) .where(USERS.ID.eq(id)) .forUpdate() .fetchOne() if (user == null) { throw NotFoundException("User not found") } user.age = newAge dslContext.update(USERS) .set(USERS.AGE, newAge) .where(USERS.ID.eq(id)) .execute() transaction.commit() } call.respond(HttpStatusCode.OK) } As you can see, we first read the user's row and lock it using jOOQ's forUpdate() method. Then we check if the user exists, update their age, and commit the transaction. Finally, we respond with an HTTP 200 OK status code to indicate success. Optimistic Version Optimistic locking is a technique where we don't lock the row when we read it, but instead, add a version number to the row and check it when we update it. If the version number has changed since we read the row, it means that someone else has updated it in the meantime, and we need to retry the operation with the updated row. To implement optimistic locking, we need to add a version column to our users table: SQL CREATE TABLE users ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(255) NOT NULL, age INT NOT NULL, version INT NOT NULL DEFAULT 0, PRIMARY KEY (id) ); We'll use the version column to track the version of each row. Now, let's update our Ktor endpoint to use optimistic locking. First, we'll read the user's row and check its version: Kotlin post("/users/{id}/age") { val id = call.parameters["id"]?.toInt() ?: throw BadRequestException("Invalid ID") val newAge = call.receive<Int>() var updated = false while (!updated) { val user = dslContext.selectFrom(USERS) .where(USERS.ID.eq(id)) .fetchOne() if (user == null) { throw NotFoundException("User not found") } val oldVersion = user.version user.age = newAge user.version += 1 val rowsUpdated = dslContext.update(USERS) .set(USERS.AGE, newAge) .set(USERS.VERSION, user.version) .where(USERS.ID.eq(id)) .and(USERS.VERSION.eq(oldVersion)) .execute() if (rowsUpdated == 1) { updated = true } } call.respond(HttpStatusCode.OK) } In this example, we use a while loop to retry the update until we successfully update the row with the correct version number. First, we read the user's row and get its current version number. Then we update the user's age and increment the version number. Finally, we execute the update query and check how many rows were updated. If the update succeeded (i.e., one row was updated), we set updated to true and exit the loop. If the update failed (i.e., no rows were updated because the version number had changed), we repeat the loop and try again. Note that we use the and(USERS.VERSION.eq(oldVersion)) condition in the WHERE clause to ensure that we only update the row if its version number is still the same as the one we read earlier. Trade-Offs Optimistic and pessimistic locking are two essential techniques used in concurrency control to ensure data consistency and correctness in multi-user environments. Pessimistic locking prevents other users from accessing a record while it is being modified, while optimistic locking allows multiple users to access and modify data concurrently. A bank application that handles money transfers between accounts is a good example of a scenario where pessimistic locking is a better choice. In this scenario, when a user initiates a transfer, the system should ensure that the funds in the account are available and that no other user is modifying the same account's balance concurrently. In this case, it is critical to prevent any other user from accessing the account while the transaction is in progress. The application can use pessimistic locking to ensure exclusive access to the account during the transfer process, preventing any concurrent updates and ensuring data consistency. An online shopping application that manages product inventory is an example of a scenario where optimistic locking is a better choice. In this scenario, multiple users can access the same product page and make purchases concurrently. When a user adds a product to the cart and proceeds to checkout, the system should ensure that the product's availability is up to date and that no other user has purchased the same product. It is not necessary to lock the product record as the system can handle conflicts during the checkout process. The application can use optimistic locking, allowing concurrent access to the product record and resolving conflicts during the transaction by checking the product's availability and updating the inventory accordingly. Conclusion When designing and implementing database systems, it's important to be aware of the benefits and limitations of both pessimistic and optimistic locking strategies. While pessimistic locking is a reliable way to ensure data consistency, it can lead to decreased performance and scalability. On the other hand, optimistic locking provides better performance and scalability, but it requires careful consideration of concurrency issues and error handling. Ultimately, choosing the right locking strategy depends on the specific use case and trade-offs between data consistency and performance. Awareness of both locking strategies is essential for good decision-making and for building robust and reliable backend systems.
With the growth of the application modernization demands, monolithic applications were refactored to cloud-native microservices and serverless functions with lighter, faster, and smaller application portfolios for the past years. This was not only about rewriting applications, but the backend data stores were also redesigned in terms of dynamic scalability, high performance, and flexibility for event-driven architecture. For example, traditional data structures in relational databases started to move forward to a new approach that enables to storage and retrieval of key-value and document data structures using NoSQL databases. However, faster modernization presents more challenges for Java developers in terms of steep learning curves about new technologies adoption and retaining current skillsets with experience. For instance, Java developers need to rewrite all existing Java applications to Golang and JavaScript for new serverless functions and learn new APIs or SDKs to process dynamic data records by new modernized serverless applications. This article will take you through a step-by-step tutorial to learn how Quarkus enables Java developers to implement serverless functions on AWS Lambda to process dynamic data on AWS DynamoDB. Quarkus enables developers not only to optimize Java applications for superfast startup time (e.g., milliseconds) and tiny memory footprints (e.g., less than 100 MB) for serverless applications, but developers can also use more than XX AWS extensions to deploy Java applications to AWS Lambda and access AWS DynamoDB directly without steep learning curves. Creating a New Serverless Java Project Using Quarkus We’ll use the Quarkus command to generate a new project with required files such as Maven Wrapper, Dockerfiles, configuration properties, and sample code. Find more information about the benefits of the Quarkus command (CLI) here. Run the following Quarkus command in your working directory. Shell quarkus create piggybank --java=17 You need to use the JDK 17 version since AWS Lambda currently supports JDK 17 as the latest version by default Java runtime (Corretto). Let’s start Quarkus Live Coding, also known as quarkus dev mode, using the following command. Shell cd piggybank && quarkus dev mode Developing Business Logic for Piggybank Now let's add a couple of Quarkus extensions to create a DynamoDB entity and relevant abstract services using the following Quarkus command in the Piggybank directory. Shell quarkus ext add amazon-dynamodb resteasy-reactive-jackson The output should look like this. Java [SUCCESS] ✅ Platform io.quarkus.platform:quarkus-amazon-services-bom has been installed [SUCCESS] ✅ Extension io.quarkiverse.amazonservices:quarkus-amazon-dynamodb has been installed [SUCCESS] ✅ Extension io.quarkus:quarkus-resteasy-reactive-jackson has been installed Creating an Entity Class You will create a new data model (entry.java) file to define Java attributes that map into the fields in DynamoDB. The Java class should look like the following code snippet (you can find the solution in the GitHub repository): Java @RegisterForReflection public class Entry { public Long timestamp; public String accountID; ... public Entry() {} public static Entry from(Map<String, AttributeValue> item) { Entry entry = new Entry(); if (item != null && !item.isEmpty()) { entry.setAccountID(item.get(AbstractService.ENTRY_ACCOUNTID_COL).s()); ... } return entry; } ... } The @RegisterForReflectionannotation instructs Quarkus to keep the class and its members during the native compilation. Find more information here. Creating an Abstract Service Now you will create a new AbstractService.java file to consist of helper methods that prepare DynamoDB to request objects for reading and adding items to the table. The code snippet should look like this (find the solution in the GitHub repository): Java public class AbstractService { public String accountID; ... public static final String ENTRY_ACCOUNTID_COL = "accountID"; ... public String getTableName() { return "finance"; } protected ScanRequest scanRequest() { return ScanRequest.builder().tableName(getTableName()) .attributesToGet(ENTRY_ACCOUNTID_COL, ENTRY_DESCRIPTION_COL, ENTRY_AMOUNT_COL, ENTRY_BALANCE_COL, ENTRY_DATE_COL, ENTRY_TIMESTAMP, ENTRY_CATEGORY).build(); } ... } Adding a Business Layer for REST APIs Create a new EntryService.java file to extend the AbstractService class that will be the business layer of your application. This logic will store and retrieve the entry data from DynamoDB synchronously. The code snippet should look like this (solution in the GitHub repository): Java @ApplicationScoped public class EntryService extends AbstractService { @Inject DynamoDbClient dynamoDB; public List<Entry> findAll() { List<Entry> entries = dynamoDB.scanPaginator(scanRequest()).items().stream() .map(Entry::from) .collect(Collectors.toList()); entries.sort((e1, e2) -> e1.getDate().compareTo(e2.getDate())); BigDecimal balance = new BigDecimal(0); for (Entry entry : entries) { balance = balance.add(entry.getAmount()); entry.setBalance(balance); } return entries; } ... } Creating REST APIs Now you'll create a new EntryResource.java file to implement REST APIs to get and post the entry data from and to DynamoDB. The code snippet should look like the below (solution in the GitHub repository): Java @Path("/entryResource") public class EntryResource { SimpleDateFormat piggyDateFormatter = new SimpleDateFormat("yyyy-MM-dd+HH:mm"); @Inject EntryService eService; @GET @Path("/findAll") public List<Entry> findAll() { return eService.findAll(); } ... } Verify the Business Services Locally First, we need to install a local DynamoDB that the piggy bank services access. There’re a variety of ways to stand up a local DynamoDB such as downloading an executable .jar file, running a container image, and deploying by Apache Maven repository. Today, you will use the Docker compose to install and run DynamoDB locally. Find more information here. Create the following docker-compose.yml file in your local environment. YAML version: '3.8' services: dynamodb-local: command: "-jar DynamoDBLocal.jar -sharedDb -dbPath ./data" image: "amazon/dynamodb-local:latest" container_name: dynamodb-local ports: - "8000:8000" volumes: - "./docker/dynamodb:/home/dynamodblocal/data" working_dir: /home/dynamodblocal Then, run the following command-line command. Shell docker-compose up The output should look like this. Shell [+] Running 2/2 ⠿ Network quarkus-piggybank_default Created 0.0s ⠿ Container dynamodb-local Created 0.1s Attaching to dynamodb-local dynamodb-local | Initializing DynamoDB Local with the following configuration: dynamodb-local | Port: 8000 dynamodb-local | InMemory: false dynamodb-local | DbPath: ./data dynamodb-local | SharedDb: true dynamodb-local | shouldDelayTransientStatuses: false dynamodb-local | CorsParams: null dynamodb-local | Creating an Entry Table Locally Run the following AWS DynamoDB API command to create a new entry table in the running DynamoDB container. Shell aws dynamodb create-table --endpoint-url http://localhost:8000 --table-name finance --attribute-definitions AttributeName=accountID,AttributeType=S AttributeName=timestamp,AttributeType=N --key-schema AttributeName=timestamp,KeyType=HASH AttributeName=accountID,KeyType=RANGE --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 --table-class STANDARD Adding DynamoDB Clients Configurations DynamoDB clients are configurable in the application.properties programmatically. You also need to add to the classpath a proper implementation of the sync client. By default, the extension uses the java.net.URLConnection HTTP client. Open the pom.xml file and copy the following dependency right after the quarkus-amazon-dynamodb dependency. XML <dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>url-connection-client</artifactId> </dependency> Then, add the following key and value to the application.properties to specify your local DynamoDB's endpoint. Java %dev.quarkus.dynamodb.endpoint-override=http://localhost:8000 Starting Quarkus Live Coding Now you should be ready to verify the Piggybank application using Quarkus Dev mode and local DynamoDB. Run the Quarkus Dev mode using the following Quarkus command. Shell quarkus dev The output should end up this. Shell __ ____ __ _____ ___ __ ____ ______ --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \ --\___\_\____/_/ |_/_/|_/_/|_|\____/___/ [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated.quarkus xx.xx.xx.) s2023-04-30 21:14:49,824 INFO [io.quarkus] (Quarkus Main Thread) Installed features: [amazon-dynamodb, cdi, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertx] -- Tests paused Press [r] to resume testing, [o] Toggle test output, [:] for the terminal, [h] for more options> Run the following curl command to insert several expense items into the piggybank account (entry table). Shell curl -X POST http://localhost:8080/entryResource -H 'Content-Type: application/json' -d '{"accountID": "Food", "description": "Shrimp", "amount": "-20", "balance": "0", "date": "2023-02-01"}' curl -X POST http://localhost:8080/entryResource -H 'Content-Type: application/json' -d '{"accountID": "Car", "description": "Flat tires", "amount": "-200", "balance": "0", "date": "2023-03-01"}' curl -X POST http://localhost:8080/entryResource -H 'Content-Type: application/json' -d '{"accountID": "Payslip", "description": "Income", "amount": "2000", "balance": "0", "date": "2023-04-01"}' curl -X POST http://localhost:8080/entryResource -H 'Content-Type: application/json' -d '{"accountID": "Utilities", "description": "Gas", "amount": "-400", "balance": "0", "date": "2023-05-01"}' Verify the stored data using the following command. Shell curl http://localhost:8080/entryResource/findAll The output should look like this. JSON [{"accountID":"Food","description":"Shrimp","amount":"-20","balance":"-30","date":"2023-02-01"},{"accountID":"Drink","description":"Wine","amount":"-10","balance":"-10","date":"2023-01-01"},{"accountID":"Payslip","description":"Income","amount":"2000","balance":"1770","date":"2023-04-01"},{"accountID":"Car","description":"Flat tires","amount":"-200","balance":"-230","date":"2023-03-01"},{"accountID":"Utilities","description":"Gas","amount":"-400","balance":"1370","date":"2023-05-01"}] You can also find a certain expense based on accountID. Run the following curl command again. Shell curl http://localhost:8080/entryResource/find/Drink The output should look like this. JSON {"accountID":"Drink","description":"Wine","amount":"-10","balance":"-10","date":"2023-01-01"} Conclusion You learned how Quarkus enables developers to write serverless functions that connect NoSQL databases to process dynamic data. To stand up local development environments, you quickly ran the local DynamoDB image using the docker-compose command as well. Quarkus also provide various AWS extensions including amazon-dynamodb to access the AWS cloud services directly from your Java applications. Find more information here. In the next article, you’ll learn how to create a serverless database using AWS DynamoDB and build and deploy your local serverless Java functions to AWS Lambda by enabling SnapStart.
Oren Eini
Wizard,
Hibernating Rhinos @ayende
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs
Sahiti Kappagantula
Product Associate,
Oorwin