GCP BigTable – tech4notes

Overview of BigTable:

Enabling you to store terabytes or even petabytes of data
A single value in each row is indexed; this value is known as the row key.
It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.

It’s is Good For:

Time-series data, such as CPU and memory usage over time for multiple servers.
Marketing data, such as purchase histories and customer preferences.
Financial data, such as transaction histories, stock prices, and currency exchange rates.
Internet of Things data, such as usage reports from energy meters and home appliances.
Graph data, such as information about how users are connected to one another.

Load balancing

Each Cloud Bigtable zone is managed by a master process, which balances workload and data volume within clusters.
Cloud Bigtable manages all of the splitting, merging, and rebalancing automatically, saving users the effort of manually administering their tablets.

Cloud Bigtable and other storage options

Cloud Bigtable is not a relational database; it does not support SQL queries or joins, nor does it support multi-row transactions. Also, it is not a good solution for storing less than 1 TB of data.

If you need full SQL support for an online transaction processing (OLTP) system, consider Cloud Spanner or Cloud SQL.
If you need interactive querying in an online analytical processing (OLAP) system, consider BigQuery.
If you need to store immutable blobs larger than 10 MB, such as large images or movies, consider Cloud Storage.
If you need to store highly structured objects in a document database, with support for ACID transactions and SQL-like queries, consider Cloud Datastore.

# Schema Design

1. As you design your Cloud Bigtable schema, keep the following concepts in mind:

Each table has only one index, the row key. There are no secondary indices.
Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian, or network, byte order, the binary equivalent of alphabetical order.
All operations are atomic at the row level. For example, if you update two rows in a table, it’s possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.
Ideally, both reads and writes should be distributed evenly across the row space of the table.
In general, keep all information for an entity in a single row. An entity that doesn’t need atomic updates and reads can be split across multiple rows. Splitting across multiple rows is recommended if the entity data is large (hundreds of MB).
Related entities should be stored in adjacent rows, which makes reads more efficient.
Cloud Bigtable tables are sparse. Empty columns don’t take up any space. As a result, it often makes sense to create a very large number of columns, even if most columns are empty in most rows.

2. Size limits

As a best practice, store a maximum of 10 MB in a single cell and 100 MB in a single row. You must also stay below the hard limits on data size within cells and rows.

Start by asking how you’ll use the data that you plan to store. For example:

3. Choosing a row key

User information: Do you need quick access to information about connections between users (for example, whether user A follows user B)?
User-generated content: If you show users a sample of a large amount of user-generated content, such as status updates, how will you decide which status updates to display to a given user?
Time series data: Will you often need to retrieve the most recent N records, or records that fall within a certain time range? If you’re storing data for several kinds of events, will you need to filter based on the type of event?

By understanding your needs up front, you can ensure that your row key, and your overall schema design, provide enough flexibility to query your data efficiently.

Types of row keys

As a general rule of thumb, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.

Use Timestamps: If you often need to retrieve data based on the time when it was recorded, it’s a good idea to include a timestamp as part of your row key.

For example, your application might need to record performance-related data, such as CPU and memory usage, once per second for a large number of machines. Your row key for this data could combine an identifier for the machine with a timestamp for the data (for example, machine_4223421#1425330757685).

Multiple values in a single row key :: When your row key includes multiple values, it’s especially important to have a clear understanding of how you’ll use your data.

Row keys to avoid

Avoid using a single row key to identify a value that must be updated very frequently. For example, if you store memory-usage data once per second, do not use a single row key named memusage and update the row repeatedly.
Avoid using standard, non-reversed domain names as row keys. Using standard domain names makes it inefficient to retrieve all of the rows within a portion of the domain (for example, all rows that relate to company.com will be in separate row ranges like services.company.com, product.company.com and so on).
Use human-readable values instead of hashed values. If your row key includes multiple values, separate those values with a delimiter.

Column families and column qualifiers

— Column families: In Cloud Bigtable, unlike in HBase, you can use up to about 100 column families while maintaining excellent performance. As a result, whenever a row contains multiple values that are related to one another, it’s a good practice to group those values into the same column family. Grouping data into column families enables you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row.

— Column qualifiers :: Because Cloud Bigtable tables are sparse, you can create as many column qualifiers as you need in each row. it’s a good idea to keep the names of column qualifiers short, which helps to reduce the amount of data that is transferred for each request.

Access Control:

For Cloud Bigtable, you can configure access control at the project level and the instance level. Here are some examples of using access control at the project level:

Allow a user to read from, but not write to, any table within the project.
Allow a user to read from and write to any table within the project, but not manage instances.
Allow a user to read from and write to any table within the project, and manage instances.

Here are some examples of using access control at the instance level:

Allow a user to read from any table in a development instance, with no access to tables in a production instance.
Allow a user to read from and write to any table in a development instance, and to read from any table in a production instance.
Allow a user to manage a development instance, but not a production instance.

Key Visualizer: Key Visualizer can help you complete the following tasks:

Check whether your reads or writes are creating hotspots on specific rows
Find rows that contain too much data
Look at whether your access patterns are balanced across all of the rows in a table

Key Visualizer scans : automatically generates hourly and daily scans for every table in your instance that meets at least one of the following criteria:

During the previous 24 hours, the table contained at least 30 GB of data at some point in time.
During the previous 24 hours, the average of all reads or all writes was at least 10,000 rows per second.

Causes of slower performance

The table’s schema is not designed correctly.
The workload isn’t appropriate for Cloud Bigtable.
The rows in your Cloud Bigtable table contain large amounts of data.
The rows in your Cloud Bigtable table contain a very large number of cells.
The Cloud Bigtable cluster doesn’t have enough nodes.
The Cloud Bigtable cluster was scaled up or scaled down recently.
The Cloud Bigtable cluster uses HDD disks.
The Cloud Bigtable instance is a development instance.
There are issues with the network connection.

Source Url: https://cloud.google.com/bigtable/docs/overview

23
Shares

Published by Tarun