Spend quality time with the data. Read through random samples and read another set of random samples, and reread if necessary. This will give you good ideas on pre-processing steps you should take, the complexity of the text and essentially what kind of language you are dealing with. Note the document lengths of the largest and smallest samples. Develop intuitions as to why they are long or small. Measure simple things such as average word length, average sentence length, and the like. If you have other components in your dataset, such as the text source, date it became available, etc., EDA on these will help.
Always build a baseline machine learning model for comparison. I recommend using bag-of-words (normalize document length across frequencies if necessary) with XGBoost and SHAP plots later. Bag-of-words is known for being extremely austere, but since it leverages the mathematical concept of multiplicity, you will be surprised to see how useful it can turn out to be. Keeping track of counts and verifying if the frequency is not due to external factors is a helpful assessment.
Check if the language inherently has certain linguistic properties that make it stand out. You can code things manually, such as simple n-gram analyses, the number of named entities, readability scores (not just the common and often misleading Flesh-Kincaid one, but other relevant formulas), etc. Or you can use neatly engineered tools such as the Coh-Metrix. Such properties can be used downstream with or without deep learning predictions.
Do not immediately remove numbers during pre-processing; explore strategies first instead. To manually intervene, you can preserve the syntax and change them to specific template tokens (7 becomes <NUMBER>; 7% becomes <NUMBER><PERCENTAGE>). I like to keep them as is, for instance, in accordance with Benford's law.
Do not immediately remove stop words. Such removal benefits frequency-based approaches that help reduce dimensionality. Do not conflate the same with deep learning. Stop word corpuses vary in different contexts and languages, and you need them when you are exploiting text semantics, as in a seq2seq model. For example, "expect" is not the same as "not expect," and "investment of the stock" is not the same as "investment stock."
Carefully evaluate your choice of padding technique. Post-padding adds noise and impacts the hidden state of the final word(s) when you are using RNN-based models, given the zeros at the end.
There's no need to immediately deep-dive into using large pre-trained LLMs or contextual language models. They require careful hyperparameter tuning and they're not as generalizable as they are marketed. A Hierarchical Attention Network is an excellent first choice. So is a simple Transformer architecture. Maybe even a deep or stacked BiLSTM.
One effective fine-tuning strategy can be to freeze and unfreeze weights (with a very low learning rates). As in the example here.
If you are exploring static pre-trained non-contextual word embeddings, try fastText. FastText leverages sub-word information and has fast loading times. GloVe and Word2Vec are too old at this point.
You can also explore methods to (later) combine models trained on separate (contextual embeddings), as suggested here. If you are building multiple deep learning models, look out for contrasting or opposing precision and recall outcomes, i.e., recall of one model is the exact same as precision of the other, and vice versa. This suggests that some kind of multistage feature learning, common in ML, will be helpful.
Try part-of-speech tagging if you have a large corpus. If words within a sentence tend to have similar parts of speech or if certain patterns frequently occur (e.g., nouns followed by adjectives, verbs followed by objects), it indicates local correlations. If there is evidence of local correlations, 1D CNNs will do quite well.
In the same context of CNNs, hybrid CNN-BiLSTM models are a great choice, but specifically if there is evidence that CNNs are yielding good performance, and in that order. CNNs can first learn a lower-dimensional representation while still preserving important features, and this will, in turn, help the processing of BiLSTM layers faster with less likelihood of errors from vanishing or exploding gradients.
The attention mechanism is a very good friend. If you have strong intuition that only specific parts of your text samples are relevant for classification, add it at the beginning of your hybrid models. Otherwise, add it as a final layer of refinement in your models, and if you notice a performance improvement, then it means classification decisions tend to be based on a larger entire context of the input text.
Attempt data augmentation, particularly if your dataset is small. Methods from this really good paper, such as random synonym replacement, randomly inserting a synonym of a random word (that is not a stop word) at a random position, word position swapping, and randomly deleting words.