• Uncategorized

About html : Linux-replace-the-nth-instance-of-a-line-matching-a-pattern-in-a-file

Question Detail

I have a file like this:

        <div class='items'>
          <div class='item'>
            <div class='itemDescription'>random string 1</div>
            <div class='itemDate'>random date 1</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 2</div>
            <div class='itemDate'>random date 2</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 3</div>
            <div class='itemDate'>random date 3</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 4</div>
            <div class='itemDate'>random date 4</div>
          </div>
        </div>

I need to be able to replace the nth instance/occurrence of item where item is a collection of lines in the file. For example when n=3.

        <div class='items'>
          <div class='item'>
            <div class='itemDescription'>random string 1</div>
            <div class='itemDate'>random date 1</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 2</div>
            <div class='itemDate'>random date 2</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 4</div>
            <div class='itemDate'>random date 4</div>
          </div>
        </div>

For example when n=2.

        <div class='items'>
          <div class='item'>
            <div class='itemDescription'>random string 1</div>
            <div class='itemDate'>random date 1</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 3</div>
            <div class='itemDate'>random date 3</div>
          </div>
          <div class='item'>
            <div class='itemDescription'>random string 4</div>
            <div class='itemDate'>random date 4</div>
          </div>
        </div>

How would I be able to accomplish this with sed?

I was hoping for something like:

sed -i "/\s*<div class='item'>/3d";
sed -i "/\s*<div class='itemDescription'>random string 4</div>/3d";
sed -i "/\s*<div class='itemDate'>random date 4</div>/3d";
sed -i "/\s*</div>/3d";

Above 3d would mean delete the 3rd instance of a match ignoring the other instances.

Using a range where n=3:

sed -i "/\s*<div class='item'>/3,/\s*</div>/{///!d}";

Above /\s*<div class='item'>/3 would mean start from the third match of the pattern instead of the first.

None of the above are valid sed statements but they would give an idea what i’m looking for.

I’m also open to the idea of using awk or another tool. awk -i inplace "..." file

Also I don’t think deleting a number of lines from the match is a good idea in case the random string becomes multi line.

I hope this is clear. Thanks for any help in advance.

Search terms…

“linux replace the nth instance of a line in a file”

“linux replace the nth occurrence of a line in a file”

“bash replace the nth occurrence of a line in a file”

Question Answer

Regular expressions and tools like sed are the wrong thing entirely for trying to work with structured data like xml. Instead you want something that can understand XML and manipulate documents based on XPath expressions. xmlstarlet is one popular such tool. For example, to delete the third item div:

$ xmlstarlet ed -d '//div[@class="item"][3]' example.xml
<?xml version="1.0"?>
<div class="items">
  <div class="item">
    <div class="itemDescription">random string 1</div>
    <div class="itemDate">random date 1</div>
  </div>
  <div class="item">
    <div class="itemDescription">random string 2</div>
    <div class="itemDate">random date 2</div>
  </div>
  <div class="item">
    <div class="itemDescription">random string 4</div>
    <div class="itemDate">random date 4</div>
  </div>
</div>

Or using hxremove from w3’s HTML-XML Utils package, which uses CSS selectors instead of XPath:

$ hxremove '.item:nth-child(3)' < example.xml 
<div class="items">
  <div class="item">
    <div class="itemDescription">random string 1</div>
    <div class="itemDate">random date 1</div>
  </div>
  <div class="item">
    <div class="itemDescription">random string 2</div>
    <div class="itemDate">random date 2</div>
  </div>
  
  <div class="item">
    <div class="itemDescription">random string 4</div>
    <div class="itemDate">random date 4</div>
  </div>
</div>

Based on the answer from @Shawn

xmlstarlet ed --pf --omit-decl --inplace -d '///div[@class="newsItem"][3]' file.html

explanation:

# ed - edit
# --pf - preserve formatting
# --omit-decl - omit xml deceleration <?xml version="1.0" ?>
# --inplace - save the changes in the file don't only print the results

sed -i '/^[[:space:]]*$/d' file.html # delete empty lines

input file.html:

<div>
  <style>
    /*...*/
  </style>
  <div class="news">
    <div class="newsContent">
      <div class="newsHeading">Latest News</div>
      <div class="newsItems row">
        <div class='newsItem'>
          <div class='newsItemDate'>random date 1</div>
          <div class='newsItemHeading'>random heading</div>
        </div>
        <div class="newsItem">
          <div class="newsItemDate">random date 2</div>
          <div class="newsItemHeading">random heading 2</div>
        </div>
        <div class="newsItem">
          <div class="newsItemDate">random date 3</div>
          <div class="newsItemHeading">random heading 3</div>
        </div>
      </div>
    </div>
  </div>
</div>

output file.html before sed:

<div>
  <style>
    /*...*/
  </style>
  <div class="news">
    <div class="newsContent">
      <div class="newsHeading">Latest News</div>
      <div class="newsItems row">
        <div class='newsItem'>
          <div class='newsItemDate'>random date 1</div>
          <div class='newsItemHeading'>random heading</div>
        </div>
        <div class="newsItem">
          <div class="newsItemDate">random date 2</div>
          <div class="newsItemHeading">random heading 2</div>
        </div>

      </div>
    </div>
  </div>
</div>

output file.html after after sed:

<div>
  <style>
    /*...*/
  </style>
  <div class="news">
    <div class="newsContent">
      <div class="newsHeading">Latest News</div>
      <div class="newsItems row">
        <div class='newsItem'>
          <div class='newsItemDate'>random date 1</div>
          <div class='newsItemHeading'>random heading</div>
        </div>
        <div class="newsItem">
          <div class="newsItemDate">random date 2</div>
          <div class="newsItemHeading">random heading 2</div>
        </div>
      </div>
    </div>
  </div>
</div>

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.